The Platform Data Technologies team has a broad mandate of enabling Netflix developers to
effortlessly build and manage systemtosystem sharing of structured data. Our goal is to maximize
engineering velocity in the Idea→POC→PROD process.
We strive to minimize the development and operation costs of adopting our infrastructure and create
strong relationships with all Netflix engineering teams to better understand and address their
painpoints and undiscovered common needs.
While improving the productivity of engineering teams is a major focus, we also want to continually
invest in raising the bar in our own team by hiring the best and enabling personal growth and
technology crosslearning while reducing the OnCall load.
Our team was initially formed to develop technologies for rapid, efficient, and reliable distribution of
Netflix’s video catalog to all midtier systems.
Innovations in this area proved to be horizontal in nature, such as Versioned Pub/Sub and Netflix
Open Source Hollow . In turn, interactions with our diverse teams helped identify highly leverageable
possibilities that led us to investing in our DataasaService Framework and Unified Logging.
We provide the infrastructure to aggregate a massive amount of video metadata from many
studiofacing systems, create a unified API, and deliver roundtheclock updates to all Netflix services
in a highly optimized and timely manner.
Since this data affects every aspect of a user’s experience, from user interface to personalized
recommendations, streaming, A/B testing, and even content caching at ISP locations, it’s critical that
our system is highly available, accommodates continuous data updates and yet minimizes and
mitigates the risk of bad data affecting downstream systems.
DataasaService (DaaS) Framework
Many teams create handcrafted systems to publish data to other systems but this always involves a
nontrivial amount of time and effort to harden, operationalize, and evolve.
Our goal was to enable creation of an endtoend fully operationalized DaaS system in about an hour!
That’s as much energy as we want teams to spend dealing with the mechanics of reliably moving
large structured data between systems. Our infrastructure generates the client APIs, dashboards,
metrics, logging, simplifies adding validation rules and circuit breakers, and more, so every team
doesn’t have to! The technologies and learnings from the Netflix Catalog system can now be applied
to any dataset and increase the velocity and reliability of all DaaS systems at Netflix.
Largepayload Versioned Pub/Sub
Everyone is familiar with common Pub/Sub mechanisms like Kafka that publish events to downstream
consumers. However, publishing a large and changing dataset to multiple cloud regions will need you
to rollout your own solution.
We observed this pattern in many Netflix services and implemented a common infrastructure that
does the heavy lifting and exposes a simple and familiar Pub/Sub model. This widely adopted
solution also provides consumer startup resiliency via fallback mechanisms in case of unavailable
datasets that a system depends on.
Netflix systems create a substantial amount of logs across multiple tech stacks that are needed to
help diagnose production systems. Over the years many solutions were used across the company
with varying degrees of universality, developerfriendliness, query latency, and cost characteristics.
After examining the ecosystem and speaking with many engineering teams, it became clear that they
would benefit greatly from a single mechanism that’s easy to adopt, scalable, cost effective, tunable
for a team’s specific needs, extensible, and integrated into Netflix’s cloud infrastructure. We’re still in
the early stages of this project but plan to leverage proven open source and Netflix technologies for a
great developer experience that minimizes the meantimetodiagnosis of problems and delivers
Some of the challenges
Small hiccup → Huge impact
With over 125M global subscribers and continuously increasing viewing hours, even shortduration
mishaps create a bad user experience and increase the OnCall load. As a Platform team, we want
our infrastructures to help teams minimize the frequency and blast radius of common issues.
We all use techniques to prevent bad code affecting system availability but less is usually done to
systemically prevent bad data poisoning the ecosystem. Unvetted data changes are at least as risky
as code changes so our team continuously invests in building and improving technologies and
bestpractices that avoid many customerimpacting issues. We routinely incorporate these
innovations into our infrastructure, make them simple to adopt, and champion this critical aspect of
system development throughout the company.
There’s a general pattern in many dataflows:
These flows frequently cross dissimilar technology stacks and engineering teams with various data
hygiene, correctness, and backwardcompatibility practices.
While data used to primarily flow from midtier systems to each other and to the data warehouse for
analytics purposes, it now increasingly flows back from the data warehouse to midtier systems as
well to provide a powerful dynamic feedback loop.
Our team will need to embrace this macro trend and help increase the velocity of building,
operationalizing, and improving availability of all dataheavy systems at Netflix regardless of their
Invest, Innovate, Iterate
The continual evolution of Netflix’s business, increasing scale, and proliferation of datacentric
services drive us to keep improving and expanding our infrastructure portfolio to stay ahead of the
Our technologies are at various maturity levels and need different types and amounts of investment.
We must investigate promising open source projects for our tech stack, optimize the operational
needs of more mature systems through automation, rearchitect to increase scalability, streamline
APIs and harden newer systems, validate and iterate on MVP feature sets, help rapidly migrate
teams from older infrastructure or custom implementations to new, generalized, and more robust
solutions, improve usability of our tools, and much much more.
With so many possibilities, we must be very judicious in making the right level of investments in each
of our technologies through thoughtful evaluation, debating priorities, gathering feedback from other
teams, and applying good judgement.
Our guiding principle in making these decisions is simply “What’s the best thing for Netflix”.