Tapjoy OpenStack Summit Paris Breakout Session

Tapjoy & OpenStack
Delivering Billions of
Requests Daily
Wes Jossey
Head of Operations @Tapjoy

Tapjoy
● Global App-Tech Startup
● We Power For Mobile Developers:
○ Monetization
○ Analytics
○ User Acquisition
○ User Retention
● 450M+ Monthly Users Across 270k+ Apps
● Worldwide Presence

Technical Details
● Early AWS Adopter.
● Grew Predominantly on AWS.
● Over 1,100 AWS VMs Daily (10/2014)
● Active Regions in Asia, Europe, N.A.
● Over One Trillion Requests Handled
Annually

Tech Philosophy
● Compute (EC2 & Nova) Driven Company
○ Operate Your Own Infrastructure
■ But Not Necessarily Built-From-Scratch
○ Zero Heart-Attack Nodes
■ All Nodes Are Ephemeral
■ Data is Always Distributed
■ Failure is Always Tolerated
■ Misbehaving Instances Are Terminated Quickly

Services We Use
● SQS
○ Simple, Inexpensive, Durable.
○ Currently Building New Internal System Influenced
by SQS, but with Different Guarantees
○ No Lock-In (See https://github.com/Tapjoy/chore)
● RDS
○ No Lock in. Simple. Easy.
● Cloudwatch (but also statsd)

Services We Use Cont.
● ELB
○ SSL Termination Only. Routing Handled Elsewhere.
● Auto-Scaling
○ Traffic can fluctuate 30% peak to valley
● S3
○ Where we store ALL the things
○ Still price competitive for what it provides. No plans
to leave as of today.

Use Compute Everywhere
● Every Dev Has Access to Either AWS or
Tapjoy-1 (Tapjoy’s OpenStack Deployment)
● Simulate Changes Against Useful Data
● Test Algorithms on Large Hadoop Clusters
● Practice for Failure With Access to Real
Services (not mock endpoints)

Going Hybrid
● We Spend in the Millions on AWS
● Picked Data-Science Infrastructure because
of Portability, and Ability to Leverage More
Nodes
● Lower Risk than Tier-1 Production Services
● Wanted a Partner to Maintain OpenStack
like Amazon ‘Maintains’ AWS
● We Want to Operate Apps

Vendors (It Matters)
● Metacloud
○ Verified our Design
○ Deployed Openstack
○ Provisioned Network
○ Allowed Us to Focus on Business Applications
● Equinix
○ Cooling & Power Design
○ Remote Hands
○ Went Above and Beyond on Numerous Occasions

Vendors: Full List
● Metacloud
● Equinix
● Quanta
● Cumulus
● Level3
● Newegg

Challenges
● Hardware Delays Killed Our Timelines
○ Blew through our contingency windows.
○ Hurt our budgets.
○ Delayed subsequent purchases
● Setting Up IP Transit Can Be Slow
● No Physical Presence in DC
○ Also a Pro
● No Internal Previous Success Story… So
Lots of Skepticism

The Not So Glamorous Job
● Negotiations Can Be Exhausting
● If You’re An Engineer, the Turn Around Time
Can Be Frustrating
● You Probably Need a Gantt Chart
● There’s Nothing Agile About Writing a Big
Check

Tapjoy-1: Data Nodes
348 ‘Data’ All Purpose Nodes
● Quanta S910-X31E: 12 Node Configuration
● Per Node
○ Intel 1265Lv3 @ 2.5GHz
○ 4x1TB 7200RPM
○ 32GB RAM
○ Dual 1Gig NIC
● ‘Recyclable’ for Other Tasks if we Evolve

Tapjoy-1: Management Nodes
12 ‘Management’ Nodes
● Quanta S180: 4 Node Configuration
● Per Node
○ Intel 2650v2 x2 @2.60GHz
○ 128GB RAM
○ 6x480GB SSD
○ Dual 10Gig NIC

High-Level
Request Flow
Architecture

Plan For Failure
● Hardware
○ I’m Not Saying You Shouldn’t Use CEPH…
■ But You’ll Notice it’s Absent Here
● Service Boundaries
○ Have Hardware & Software Contingencies
■ Backup Links
■ Temporary Cache(s)
○ Actually Test Failure in Production

Info
● Twitter! @dustywes
● Email: wes@tapjoy.com

Tapjoy OpenStack Summit Paris Breakout Session

More Related Content

What's hot

Similar to Tapjoy OpenStack Summit Paris Breakout Session

Recently uploaded

Tapjoy OpenStack Summit Paris Breakout Session