Presented by Tom Maule, Software Solutions Architect, NowTV
Experience level: Advanced
NOW TV’s customer base has tripled year on year, and linear event streaming has become a greater and greater percentage of their viewing. This talk will take you through the lessons we learnt following one unfortunate event in 2014 which led NOW TV to catastrophic failure, and the improvements and scalability challenges we were faced with when preparing for the same event just 1 year later. Hear how we turned our entire failure case behaviour on its head; from always failing in favour of NOW TV, to failing in favour of the customer. Learn how we ensure that high availability of video assets on the CDN ensures continued playout irrespective of the state of our application servers. Take away some tips of how NOT to scale, as well as how we reversed our fortunes and delivered stability during the biggest event on NOW TV to date.
Choreo: Empowering the Future of Enterprise Software Engineering
MongoDB Days UK: NOW TV and Linear Streaming: Scaling MongoDB for High Load Events
1. NOW TV and Linear Streaming:
Scaling MongoDB for High Load Events
Tom Maule – NOW TV Solution Architect
2. 2
• Tom Maule
– Solution Architect at NOW TV, Sky
– Previously Senior Java Developer on NOW TV Platform team (since project inception in early 2012)
I have also previously worked in the defence and telecoms industries
tom.maule@sky.uk
linkedin.com/in/tommaule
@tommaule
Who am I?
3. 3
Abstract
• NOW TV Introduction
• Linear streaming challenges
• 7th April 2014
• Fixes and improvements
• 13th April 2015
• Future work and next steps
4. 4
Introduction - Overview
• NOW TV is the online no-contract TV streaming service from Sky
• Available on over 60 devices including the award-winning NOW TV Box
• NOW TV offers movies and entertainment VOD and linear content, and for the first time in the UK,
pay-as-you-go Sports linear content
7. 7
Introduction - NOW TV Architecture
CDN
Content
MongoDB:
Content Metadata
MongoDB:
Account Data
VOD
Transcoding
Linear
Transcoding
CDN
Manifest and
video chunks
Live video stream
Stream
upload
Asset
upload
Content metadata,
User services
User
device
Video
Assets
NOW TV Platform
Load Balancer Load Balancer
Services
Logs
Splunk
Cloud
Manager
Icinga
Monitoring
& alerting:
New
Relic
8. 8
Video On Demand (VOD)
• Video content, available on demand, whenever users want it.
• Platform load is predictable, just ask any of Netflix, Amazon Instant Video, YouTube, etc
9. 9
Video On Demand (VOD)
• Even weekend load, though busier during the day, remains predictable
10. 10
Linear Streaming
• Unlike other OTT (Over-the-Top) Providers, NOW TV offers streaming of live channels
• This is typically NOT predictable
• Load is driven by live events, not by time of day
Linear VOD
11. 11
NOW TV and Linear Streaming: The unpredictable scalability challenge
Tom Maule – NOW TV Solution Architect
16. 16
What happened?
• High load stressed our MongoDB instance
• Retries only compounded the problem
• Observed issues:
– Customers couldn’t start new streams
– Existing streams were terminated
– Concurrency errors during and shortly after the outage
– Very high read and write queues in MongoDB
– Viewing History APIs performed very slowly
– High proportion of time was spent updating indexes in MongoDB
17. 17
Issues to Address
• Heartbeating resiliency
• Concurrency inaccuracies
• Products storage
• Viewing History
• Indexes in MongoDB
• MongoDB write lock
H
C
P
V
I
M
H C P V I M
18. 18
Heartbeating: Introduction
• After playout initiation, actual video chunks are served by CDN, and don't touch our platform
• Lightweight heartbeats call back to our platform to notify us of continued playout every 10 mins
• NOW TV use heartbeats to:
– Enforce concurrency rules
– Enforce entitlement
– Record bookmark positions (VOD only)
CDN
NOW TV
Video chunks
Heartbeats
(10 min interval)
H C P V I M
19. 19
Heartbeating: Previously
• Previously, a non-OK heartbeat response would terminate playout on the user’s device
• Fail in favour of NOW TV
– When NOW TV platform is unavailable, existing playouts are terminated on next heartbeat.
CDN
NOW TV
Video chunks
Heartbeat
non-OK response
P V I M
H C
20. 20
Heartbeating: Today
• Today, playout continues unless a specific STOP heartbeat response is received
• Fail in favour of the customer
– Existing streams will NOT be terminated if NOW TV becomes unavailable
CDN
NOW TV
Video chunks
Heartbeat
non-STOP response
P V I M
H C
21. 21
Heartbeating: Future
• Game of Thrones Linear customers produce ripple-effect heartbeating
– Due to heartbeats fixed to a 10 minute period
• In future, we will randomise the first heartbeat period in attempt to smooth out these ripples
P V I M
H C
22. 22
{
“playouts”: []
}
Concurrency: Introduction
• Concurrency of 2 streams is managed through the concept of Playout Slots
• A playout slot keeps track of a currently playing stream
• Slots are allocated on playout initiation
{
“playouts”: [
{
“id” : “ABC123”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
}
]
}
{
“playouts”: [
{
“id” : “ABC123”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
},
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
}
]
}
NOW TV
MongoDB
Play
Play
Play
C P V I MH
23. 23
{
“playouts”: [
{
“id” : “ABC123”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
},
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
}
]
}
Concurrency: Introduction
• Slots are updated on heartbeats to refresh the time stamp
• Expired slots are re-allocated on next playout request
• Slots are terminated on an END event
{
“playouts”: [
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
}
]
}
NOW TV
MongoDB
END
Play
{
“playouts”: [
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
},
{
“id” : “CBF789”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
}
]
}
P V I M
CH
24. 24
{
“playouts”: [
{
“id” : “ABC123”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
},
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”
}
]
}
Concurrency: Previously
• Failure to receive an END event (due to app crash or connectivity loss), blocked a slot until timeout
• Previously, this blocked subsequent playouts for up to 10 minutes
• “Concurrency limit reached” errors were seen after our service had been restored on GoT night
NOW TV
MongoDB
Play
P V I M
CH
25. 25
{
“playouts”: [
{
“id” : “ABC123”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”,
“deviceId” : “box1”
},
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>” ,
“deviceId” : “box2”
}
]
}
Concurrency: Today
• Now, slots allocated to the same Device ID can be ‘reclaimed’
• No more “Concurrency limit reached” errors following app crashes or service outages
NOW TV
MongoDB
Play
{
“playouts”: [
{
“id” : “FCE987”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>”,
“deviceId” : “box1”
},
{
“id” : “DEF456”,
“heartbeat”: “<timestamp>”,
“content”: “<content_id>” ,
“deviceId” : “box2”
}
]
}
box
1
box
2
P V I M
CH
26. 26
Product Storage: Previously
• Every purchase and renewal of any product resulted in a new Product entity in MongoDB
Entertainment – June 2015
Movies – August 2015
Sports – 20th July 2015
Entertainment – July 2015
Entertainment – August 2015
Movies – September 2015
Entertainment – September 2015
Sports – 12th September 2015
Movies – October 2015
Entertainment – October 2015
Entertainment – November 2015
P
V I MH C
From June 2015
20th July 2015
From Aug 2015
12th Sept 2015
27. 27
Product Storage: Today
• We store Entitlement entities instead of products, updating on renewals rather than duplicating
Entertainment – June 2015
Movies – August 2015
Sports – 20th July 2015
Entertainment – July 2015Entertainment – August 2015
Movies – September 2015
Entertainment – September 2015
Sports – 12th September 2015
Movies – October 2015
Entertainment – October 2015
Movies – November 2015
Entertainment – November 2015
P
V I M
From June 2015
20th July 2015
From Aug 2015
12th Sept 2015
H C
28. 28
Viewings & Bookmarks: Introduction
• Viewing a VOD asset => Viewing
• Heartbeating during a VOD asset => Bookmark
• Viewings and Bookmarks were stored separately
• No capping or archiving
V
I MPH C
29. 29
Viewings & Bookmarks: Previously
• Upon fetching a customer’s viewing history, multiple database queries were made:
- 1 query to the viewings collection to fetch n viewings for the customer
- n queries to the bookmarks collection to fetch the bookmark position for each viewing
- TOTAL: n + 1 MongoDB queries for a single request!
- Some customers had thousands of items in their viewing history!
{
“_id”: “abc123”,
“accountId”: “account1”,
“contentId”: “movie1”,
“timestamp”: “<timestamp>”
}
{
“_id”: “bcd345”,
“accountId”: “account1”,
“contentId”: “movie2”,
“timestamp”: “<timestamp>”
}
{
“_id”: “cde456”,
“accountId”: “account1”,
“contentId”: “episode1”,
“timestamp”: “<timestamp>”
}
Viewings
{
“_id”: “fed987”,
“accountId”: “account1”,
“contentId”: “movie1”,
“position”: 1187
}
{
“_id”: “edc765”,
“accountId”: “account1”,
“contentId”: “movie2”,
“position”: 2854
}
{
“_id”: “dcb543”,
“accountId”: “account1”,
“contentId”: “episode1”,
“position”: 3542
}
Bookmarks
}
V
I MPH C
30. 30
Viewings & Bookmarks: Today
• The original reason for keeping viewings and bookmarks separate was no longer apparent
• Now, viewings and bookmarks are merged
– Unnecessary document ID replaced with compound ID – improving indexing efficiency
– Shortened field names - reducing storage consumption and further improving indexing efficiency
{
“_id”: “abc123”,
“accountId”: “account1”,
“contentId”: “movie1”,
“timestamp”: “<timestamp>”
}
Viewing
{
“_id”: “fed987”,
“accountId”: “account1”,
“contentId”: “movie1”,
“position”: 1187
}
Bookmark
{
“_id”: {
“accountId”: “account1”,
“contentId”: “movie1”
},
“position”: 1187,
“timestamp”: “<timestamp>”
}
View History
{
“_id”: {
“aid”: “account1”,
“cid”: “movie1”
},
“pos”: 1187,
“ts”: “<timestamp>”
}
V
I MPH C
35. 35
NOW TV Customer Base 2014 - 2015
• Our customer base TRIPLED, again, in the year up to April 2015
2013 2014 2015
36. 36
NOW TV and Linear Streaming: The unpredictable scalability challenge
Tom Maule – NOW TV Solution Architect
37. 37
What happened?
• Good platform availability throughout
• 2.5x the load that affected us just one year earlier
• Twice the normal concurrency for a typical Monday night
39. 39
Recognition
MongoDB Innovation Award 2015
recognises organisations that are
creating ground-breaking applications.
These projects represent the best and
most innovative work in the industry
over the last year.
DTG Innovation Award 2015
recognises organisations which
have driven innovation in a
particular technology or sector
40. 40
What’s Next For NOW TV?
• Our growth is expected to continue along the same trajectory
• Moving to active-active datacentre architecture for increased resiliency
• Cloud-based ‘overflow’ scaling for high-load events
• Microservices
• Sub-system resiliency
41. 41
Credits
• The entire NOW TV Technology team
are credited with our success
– Platform Software Engineers
– Platform Quality Assurance Engineers
– Dev-Ops Engineers
– App Developers & Testers
– Analysts, scrum masters and management
• MongoDB Consultants and Technical
Services Engineers
• Be a part of our future success, work for NOW TV at Sky
– Sky’s Social Job Site (http://rfer.us/BSKEti5rp)
– @workforsky
Good morning & welcome to my talk
Challenges NOW TV face with linear streaming
and how we’ve scaled MongoDB to handle high load events
First a little introduction…
Introduce: product, customer base, architecture
Challenges we face with Linear streaming vs our competitors
7th April 2014: what and why
Fixes and improvements
In contrast, 13th April 2015
Future work, next steps, and how you can be part of it
NOW TV: online, no-contract, devices, incl. NOW TV Box.
Aggressive ambitions;
July 2012: Launch Movies (6M after project inception)
March 2013: Sports
July 2013: NOW TV Box
October 2013: Entertainment
Ambitions -> Extraordinary growth
Customer base tripling
Challenges:
focus on scalability
AND product roadmap
Unique position:
vod AND linear
both DRM,
concurrency
No other UK OTT Provider...
Unique challenge to scale platform
Simplistic view; 2 halves: NOWTV Platform, & Stream transcoding and upload stack (S.T.U.S)
Platform: SSL, Cache, Security – Distribute load
Content discovery (catalogue; Groovy, JSON & XML REST APIs, cached, MongoDB)
User services (auth, purchase, playout; Groovy, JSON & XML REST APIs, MongoDB)
S.T.U.S: Upload, availability of video chunks, propagating through the CDN, to edge servers
Signin -> playout : load intensive on Platform, then chunks from CDN
INTRODUCE HEARTBEATS… but more on that later…
Now know architecture; look at what we do…
Video On Demand; not new…
VOD load driven by time of day – GRAPH (Typical week in May)
Even weekend load… <NEXT SLIDE>
Similar story: predictable
Allows for:
Accurate forecasting
Predictive auto-scaling – when you can predict what your load will be…
easy preparations
easy scaling
easy alerting
Unlike most other OTT, NOW TV live streaming
Less predictable; load driven by event not time of day <GRAPHS>
(Football match 4pm kickoff)
With this in mind, lets take a look at what happened on 7th April 2014… <NEXT SLIDE>
“Game of Thrones”: Biggest show to hit the small screen in UK, US and around the world
Popular fantasy drama -> NOW TV: s4 Premiere; 9pm 7th April 2014
Sorry to say, caught off-guard:
To 8:53pm: Normal
Next 4mins: Slow down, increased response times, isolated failures
By 8:57pm: Entire system failure, unresponsive, no new streams, old streams terminated
Customer sentiment quickly rippled across Twitter… <NEXT SLIDE>
Suffice to say, users not pleased
GoT, other Ents, Movies and Sports affected.
Credit customers - creative tweets!
One customer – creating memorabilia… <NEXT SLIDE>
That’s right, for some time you could actually purchase a Game of Thrones T-Shirt featuring…
Oh, and baby grow… <ANIMATE>
Head of Tech at the time wore his T-shirt every GoT night:
- goodluck charm or reminder of where we’ve come from?
It wasn’t just Twitter that picked up on our outage… <NEXT SLIDE>
Online news sites - incl. Digital Spy and CNet
High-profile news organisations – incl. BBC, Telegraph
HBO Go’s troubles… didn’t help visibility of ours…
Could nobody handle the popularity of Game of Thrones?
How did we miss this?
Linear load profile assumptions -> Wrong!
Let me show you… <GRAPH>
Typical sports linear; 3pm Kickoff, 5-6x load at peak
GoT; no pre-event buildup, 80% in 5m prior to event, load condensed
concurrency <INTRO CONCURRENCY> leaped 2.5x in <5mins
50% higher than any previous linear. 100x load at peak
High load stressed MongoDB; queues, timeouts
Retrying only compounded
No new streams. Already streaming = kicked off (despite video asset on CDN)
Concurrency errors (->10m after)
MongoDB Queues; read and write
Entitlement and Viewing History - slow
MongoDB indexes – bottleneck
Analysis; Logs, New Relic APM, MongoDB Cloud Manager -> issues of concern to address
Heartbeating not resilient to platform unavailability
Concurrency checking was liable to inaccuracies when playouts not successfully ended
Purchased products stored inefficiently in our database
Fetching Viewing History was very inefficient; high number of database queries per request
Updating indexes in MongoDB under load caused bottlenecks, and MongoDB’s write lock
I briefly introduced the concept of heartbeating earlier… <RE-INTRODUCE HEARTBEATING>
So, previously, and at the time of the Game of Thrones outage… <ANIMATE>
This was very much failing in favour of NOW TV;
basically averting any risk of exposing our content to a user who is no longer entitled,
by being overly cautious when there were unknowns
Today, however… <ANIMATE>
In this way we fail in favour of the customer;
permitting playout to continue for a limited period
until we can determine that we can specifically instruct the device to terminate the stream.
Side effect of heartbeating; ripple-effect every 10 minutes… <DESCRIBE>
Doesn’t actually cause us any issues (yet),
anything we can do to smooth out our load is beneficial.
So, in future…
So that was Heartbeating. Lets take a look at Concurrency <NEXT SLIDE>
Concurrency is… <INTRODUCE CONCURRENCY>
NOW TV; managed through Playout Slots
stored in MongoDB
keep track of a currently playing stream.
Slots are allocated on playout initiation <ANIMATE>
Slots maintained by heartbeats;
updating the timestamp – preventing time out.
Slots released on stream termination;
END event
<ANIMATE>
Previously, no END event = no playout slot release until time out
<ANIMATE> (Crashed app example)
Subsequent playout blocked – user at their concurrency limit.
Inconsistent state: playout slots <=no longer accurately represented=> actual playouts
INTRODUCE “Playout Slot Reclaim”
+DeviceId to Slots… <ANIMATE>
No more “Concurrency limit reached” errors when an app crashed or following a service issues.
That was fly-by of concurrency across NOW TV, Lets take a look at Products and Entitlements
Previously – new DB entity on every purchase and renewal <ANIMATE>
Built for reporting & business intelligence reasons
Capture as much data as possible
These since moved out of platform
Definitely room for improvement… <NEXT SLIDE>
Today – entitlements instead of products.
No more ‘purchase history’ – just current view <ANIMATE>
Obvious gains; reduction in entities = less data = faster queries & cheaper storage
Another reduction we made in database entities was around Viewing History… <NEXT SLIDE>
Introduce ‘My TV’
Viewing VOD = viewing entity
Heartbeating VOD = bookmark entity
Legacy reasons – separate entities <EXPLAIN>
No cap or archiving – growing since launch – for reporting & business intelligence
So lets take a closer look at that… For ‘My TV’ view, multiple DB queries… <ANIMATE>
You may be looking at those n queries and questioning…
-> just one further query
= total of two queries
Yep, we did the same. It was an obvious inefficiency in our code
but we went one step further than that…
Req to support users*-account was no longer apparent = data merge… <ANIMATE> <EXPLAIN CMPOUND>
But we went one step further than that still;
Relational databases: column names are part of the table
Document-based databases like MongoDB: field name in every document = repetition
So we shortened the field names to save space while remaining readable… <ANIMATE>
+Archiving scheme: keep the dataset recent & relevant = keep collection size down
Now made efficiencies in terms of disk space… attention to Indexing… <NEXT SLIDE>
During GoT – MongoDB spent large proportion of time maintaining Indexes
MongoDB duplicates data when building indexes <EXPLAIN> <ANIMATE 1ST HALF>
Indexes > Dataset. Performs best when index fits in memory => Remove unnecessary indexes
<ANIMATE 2ND HALF> Compound index benefits
Entities 2->1 and indexes: 6 -> 2 = lower memory reqs and speeding up writes
One further inefficiency of MongoDB that we highlighted was Write Locks…
I imagine nearly everyone in the room is familiar with MongoDB <ANIMATE & EXPLAIN>
Now, previously, a write lock in MongoDB was global… <EXPLAIN>
because MongoDB was assuring consistent reads for us.
MongoDB 2.2: Write lock from Instance-level to Database-level
better, but still slowing us down - multiple collections per database…. <NEXT SLIDE>
So we split them out – most heavily used -> own database.
So now <EXPLAIN SAME EXAMPLE>
March this year: MongoDB 3.0: 2 storage engines
One offering collection-level locking (MMAPv1)
The other offering document-level locking (WiredTiger)
-> For NOW TV future
With all of these improvements under our belts
need to baseline platform capacity (for GoT S5 Premiere, April 2015)
We stress tested; blend of traffic profiles, incl GoT 2014 aggressive ramp-up <ANIMATE>
Achieved capacity of over 4x GoT 2014 load.
Fantastic milestone – confidence in improvements to our codebase
& in our deployment of MongoDB
Perf figures reassuring – year to April 2015; customer base TRIPLED again
<ANIMATE>
Despite this phenomenal growth, the forecasted load for Game of Thrones 2015 was comfortably within our platform capacity.
Game of Thrones Season 5 Premiere
Airing at 9pm on 13th April 2015,
The hit show returned to NOW TV; which was nearly as exciting for us as it was for our customers.
It was a real moment of truth for our platform that had come far over the last year.
So how did we do?...
I’m pleased to say we handled the load without a hitch.
Load 2.5x higher than we saw during Game of Thrones Season 4 Premiere in 2014
TWICE the normal Monday night concurrency
To help put into perspective… <EXPLAIN and ANIMATE>
Around 100x average weeknight load at peak
This time around the sentiment across Twitter was much nicer to read…
Customers acknowledged last year’s difficulties
actively congratulating us on turning around our fortunes
It was a wonderful sentiment that they had stuck with us through it all
We’ve been recognised for the work and Innovation we’ve done over the past year - TWO awards:
MongoDB Innovation Award
database performance improvements, some of which I’ve talked about today.
Digital TV Group
innovation across the whole of NOW TV; product offering, NOW TV box, apps and services.
So, where do we go from here?
Achieved a lot over the last 18months, great GoT success, but not complacent – growth on same trajectory…
Active-active; spread load, lower latency, greater resiliency in case of datacentre or data link failure.
Cloud ‘overflow’ – not maintain 100% peak capacity for <1% of time. But how to trigger scaling up?
Microservices… - independent development, deployability and scalability
NOW TV dependent on lots…. How to ensure end-to-end functionality if dependency is unavailable – circuit breakers etc
All achievements not possible without all NOW TV Technology…
+credit to the Professional Services of MongoDB
+credit to whole NOW TV business; driving our growth and repairing the damage caused to our brand
+thanks to all our customers who stuck with us, we really appreciate it!
NOW TV’s complete turn of fortune is a real testament to our excellent engineering teams and our strong relationship with MongoDB.
NOW TV is growing and we need talented people to help drive our future success.
If interested in joining team then Sky Social Job Site or speak to me.
Thank you very much for listening,
Its been a pleasure to tell you all about our service for the first time,
And I’ll be happy to take any questions you may have.
PREPARED QUESTION RESPONSES:
[Ticketmaster example?]
“seem like obvious mistakes” – Of course that’s easy to say in retrospect but at the time we were up against it… Requirements were incredibly fluid and we worked really hard to keep up with all the changes. Unfortunately this meant we lost sight of some performance metrics, and coupled with a misunderstanding / incorrect assumption around our expected Game of Thrones load led us to the catastrophic failure.
-> “easy in hindsight and we've improved not only our processes and our ways of working and forecasting to ensure were ready for the future”
“prioritising features/deadline over quality product” – TBC
“what does failing in favour of NOW TV mean?” – TBC
“alternatives to MongoDB” – Casandra, Hadoop?