The document discusses Tapjoy's use of OpenStack and AWS. Tapjoy is a global app-tech startup that powers monetization, analytics, user acquisition and retention for mobile developers. They were an early AWS adopter but grew to over 1100 AWS VMs daily, so decided to build their own OpenStack deployment (Tapjoy-1) for additional compute capacity and flexibility. Key points included partnerships with Metacloud and Equinix to deploy and manage Tapjoy-1, challenges around hardware delays and negotiations, and plans to use both AWS and Tapjoy-1 flexibly based on application needs.
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Tapjoy Delivers Billions of Requests Daily with OpenStack Hybrid Cloud
1. Tapjoy & OpenStack
Delivering Billions of
Requests Daily
Wes Jossey
Head of Operations @Tapjoy
2. Tapjoy
● Global App-Tech Startup
● We Power For Mobile Developers:
○ Monetization
○ Analytics
○ User Acquisition
○ User Retention
● 450M+ Monthly Users Across 270k+ Apps
● Worldwide Presence
3. Technical Details
● Early AWS Adopter.
● Grew Predominantly on AWS.
● Over 1,100 AWS VMs Daily (10/2014)
● Active Regions in Asia, Europe, N.A.
● Over One Trillion Requests Handled
Annually
4. Tech Philosophy
● Compute (EC2 & Nova) Driven Company
○ Operate Your Own Infrastructure
■ But Not Necessarily Built-From-Scratch
○ Zero Heart-Attack Nodes
■ All Nodes Are Ephemeral
■ Data is Always Distributed
■ Failure is Always Tolerated
■ Misbehaving Instances Are Terminated Quickly
5. Services We Use
● SQS
○ Simple, Inexpensive, Durable.
○ Currently Building New Internal System Influenced
by SQS, but with Different Guarantees
○ No Lock-In (See https://github.com/Tapjoy/chore)
● RDS
○ No Lock in. Simple. Easy.
● Cloudwatch (but also statsd)
6. Services We Use Cont.
● ELB
○ SSL Termination Only. Routing Handled Elsewhere.
● Auto-Scaling
○ Traffic can fluctuate 30% peak to valley
● S3
○ Where we store ALL the things
○ Still price competitive for what it provides. No plans
to leave as of today.
7. Use Compute Everywhere
● Every Dev Has Access to Either AWS or
Tapjoy-1 (Tapjoy’s OpenStack Deployment)
● Simulate Changes Against Useful Data
● Test Algorithms on Large Hadoop Clusters
● Practice for Failure With Access to Real
Services (not mock endpoints)
8. Going Hybrid
● We Spend in the Millions on AWS
● Picked Data-Science Infrastructure because
of Portability, and Ability to Leverage More
Nodes
● Lower Risk than Tier-1 Production Services
● Wanted a Partner to Maintain OpenStack
like Amazon ‘Maintains’ AWS
● We Want to Operate Apps
10. Vendors (It Matters)
● Metacloud
○ Verified our Design
○ Deployed Openstack
○ Provisioned Network
○ Allowed Us to Focus on Business Applications
● Equinix
○ Cooling & Power Design
○ Remote Hands
○ Went Above and Beyond on Numerous Occasions
11. Vendors: Full List
● Metacloud
● Equinix
● Quanta
● Cumulus
● Level3
● Newegg
12. Challenges
● Hardware Delays Killed Our Timelines
○ Blew through our contingency windows.
○ Hurt our budgets.
○ Delayed subsequent purchases
● Setting Up IP Transit Can Be Slow
● No Physical Presence in DC
○ Also a Pro
● No Internal Previous Success Story… So
Lots of Skepticism
13. The Not So Glamorous Job
● Negotiations Can Be Exhausting
● If You’re An Engineer, the Turn Around Time
Can Be Frustrating
● You Probably Need a Gantt Chart
● There’s Nothing Agile About Writing a Big
Check
14. Tapjoy-1: Data Nodes
348 ‘Data’ All Purpose Nodes
● Quanta S910-X31E: 12 Node Configuration
● Per Node
○ Intel 1265Lv3 @ 2.5GHz
○ 4x1TB 7200RPM
○ 32GB RAM
○ Dual 1Gig NIC
● ‘Recyclable’ for Other Tasks if we Evolve
22. Plan For Failure
● Hardware
○ I’m Not Saying You Shouldn’t Use CEPH…
■ But You’ll Notice it’s Absent Here
● Service Boundaries
○ Have Hardware & Software Contingencies
■ Backup Links
■ Temporary Cache(s)
○ Actually Test Failure in Production