Twitter Streaming API Architecture
by jkalucki on Apr 27, 2010
- 10,011 views
Follow a Tweet from creation to timeline and Streaming API delivery. The design of streaming within Twitter is influenced by the entire Twitter architecture, the direction of the platform, data syndica...
Follow a Tweet from creation to timeline and Streaming API delivery. The design of streaming within Twitter is influenced by the entire Twitter architecture, the direction of the platform, data syndication policies and Quality of Service requirements. We'll discuss these influences and our system implementation.
Accessibility
Categories
Tags
More...Upload Details
Uploaded via SlideShare as Apple Keynote
Usage Rights
© All Rights Reserved
Statistics
- Favorites
- 21
- Downloads
- 286
- Comments
- 1
- Embed Views
- Views on SlideShare
- 9,861
- Total Views
- 10,011
1–1 of 1 previous next
What are our constraints?
We have four big goals for Streaming.
First, We want users to have a low latency experience.
Instant feels like the right speed for Twitter. Not 18 seconds later. More or less right now.
Second, every write into Twitter is an event that someone, somewhere might be interest in.
We want to expose more event types than just new Tweets.
Sometimes you just need everything. And you also a place to put it.
And finally, we need to make Large Scale integrations with other services as easy as possible.
You shouldn’t have to wrestle with parallel fetching, rate limiting, and all that.
It should be easier for all developers to get data out of Twitter.
for large scale integrations, or for
exposing more and more event types.
The REST model may great for many things, but for real-time Twitter
where you just want to know what’s changed
we’ve already pushed Request Response too far.
It’s painful.
Or a million users. Or ten million. Impossible.
As Twitter adds more features, this just gets worse.
It’s just not practical to lift rate limits high enough to meet everyone’s goals.
The real-time REST model is near a the point of collapse.
A lot of effort goes into responding to each API request.
There’s a lot to do, a lot of data to gather, and
none of it is on that front end box handling the request.
To make matters more difficult, the the cost and latency distributions are very wide --
from a cheap cache hit to a deep database crawl.
Keeping latency low is a struggle.
It needs some controls especially if rate limits are removed.
And, will still need to preserve all of our policies around
abuse, privacy, terms of use, and so forth.
Everyone has to play by the same rules and it must be
possible for everyone to have a chance at building a sustainable business.
We’ve come to some win win decisions about the firehose and other elevated access levels.
I think we can make nearly everyone very happy.
Go to the Corp Dev Office hours at 2:30pm for more detail about our Commercial Data Licenses.
Keep this in mind- solving these policy issues are requirements, just as much as the technology issues are.
We’ve already proven that we can offer low latency streams of all Twitter events.
We’ve been streaming these events to ourselves for quite some time.
Twitter Analytics, for example, takes various private streams to feed experimental and production features.
Pleasant.
So, how does Streaming work?
we gather interesting events everywhere in the Twitter system and
apply those events to each Streaming server.
Inside the server,
we examine the event just once, and
route to all interested clients.
It has turned out to be practical, stable and very efficient.
Little effort is wasted.
Yes, we look at each event on each of our streaming servers,
but that’s really nothing compared to processing billions of requests
only to say: sorry no new tweets yet.
Since each event is delivered only once, there’s no bandwidth wasted.
Latency is very low too. More on that later.
We can add new event types to streams without having everyone recode to hit new endpoints.
Just like adding new fields to JSON markup is future proof,
we can also easily add new events to existing streams.
When you are ready to use the new events, you can,
otherwise, ignore them.
Well, its a continuous stream of discrete JSON or XML messages.
We deliver events at least once and in roughly sorted order.
In general, during steady state, you’ll see each event exactly once with a practical K sorting.
I’ll talk more about how these properties affect you at my other talk.
The data isn’t always display ready or even display worthy --
you need to post-process the Streaming API.
Also, the streaming api servers don’t do much markup rendering -- that happens upstream in Ruby Daemons -- so whatever rendering quirks you are used to on the REST API, well, they’ll be here too.
At least it’s always the same quirks.
It’s all a downstream model.
Users do things, stuff happens, and we route a copy to Streaming.
Let’s look at how we handle a common event: the creation of a new tweet.
They ack the user, then drop a message into a Kestrel message queue for offline processing.
This way we can give user feedback, yet defer the heavy lifting to our event driven architecture.
The tweets are fanned out to internal services: search, streaming, facebook, mobile, lists, and timelines.
As an example, timeline processing daemons read the event, serially look up all the followers in Flock and re-enqueue large batches of work.
Even before this flock lookup completes, another timeline daemon pool reads these batches then
updates the memcache timeline vector of all the followers in a massively parallel fashion.
The other server do their own thing, and the tweet is eventually published everywhere.
Hosebird is the name of the Streaming server implementation.
I really don’t like it. But the name stuck.
Anyway, we use kestrel fanout queues to present each event to each fanout Hosebird process.
Fanout queues duplicate each message for each known reader.
Kestrel queues are bomb-proof and relatively inexpensive, but they aren’t free.
Cascading is where a hosebird process reads from a peer via streaming HTTP, just like any other streaming client.
No coordination is needed and we’re eating our own dogfood.
There’s hardly any latency added by cascading, but the cost savings are considerable when there’s a large cluster of hosebird machines.
Also, we get rack locality of bandwidth, as the hosebirds are generally together in a rack,
while the kestrels are located on another isle.
How do the servers work internally?
Hosebird runs on the JVM.
It’s written in Scala.
And uses an embedded Jetty webserver to handle the front end issues.
We feed each process 8 cores and about 12 gigs of memory.
And they each can send a lot of data to many many of clients.
Filtered events are sent through a Java queue
then read by the connection thread which handles the socket writing details.
We use the Grabby Hands kestrel client to provide
highly parallel and low latency
blocking transactional reads from Kestrel.
We use our own Streaming client in the cascading case.
Both fetching clients are very efficient and hardly use any CPU.
to prevent a lot of worrying about concurrency issues.
It’s not a panacea but it has made much of this work trivial.
Actors currently fall down if you have too many of them,
so we use the Java concurrency model to host the connections.
Otherwise its all Actors.
The year 1997 is calling to mock me, I’m sure. But so far it hasn’t mattered.
The memory utilization isn’t a limiting factor, and it keeps things very simple.
Feeds keep a circular buffer of recent events to support the count parameter and some historical look back.
I had to parallelize the JSON and XML parsing,
which turned out to be the big CPU burn and probably our major tweets per second scaling risk.
Arbitrary composition in conf files a pretty powerful concept.
So, to create user streams, I just had to forward events from all these other existing feeds
into the User feed and write some custom delivery logic.
Yes, there are streams of direct messages. And social graph changes. And other interesting things.
We can’t expose them just yet due to privacy policy issues.
But, we’ll get there. Plans have been laid.
pass through all of these components, and
be presented to your stream.
If all is running well with all of the upstream systems --
tweets and other events are usually delivered with an average latency of about 160ms.
Sometimes I find it funny that outside devs say
“hey, did you know that you are throwing 503s on this endpoint”.
Yes, we know. There’s a graph for it.
If there isn’t a graph -- we immediately add one.
And we roll the key ones up into a grid of 12 summary graphs that everyone watches.
There’s also a bank of graph monitors in ops.
This was taken during peak load on a typical weekday.
You can see a blip about half way through.
Given that all clusters moved in unison, there was probably an upstream garbage collection in kestrel, or something similar.
We’ve put a lot of effort into lowering Twitter latency and keeping it low and predictable.
(If visible, blue line is a cascaded cluster, where yellow and green are fanout only.)
You get to see a lot that happens to you -- who favorited your tweet, who followed you, and so forth --
in real time.
You also get to see what your followings are doing. Who they favorited and followed.
There’s a huge opportunity for discovery here with User Streams.
If two friends favorite a tweet, and two others follow the tweeter, show me the tweet!
We know that User Streams are transformative. Goldman and I were watching #chirp during Ryan’s talk.
It was incredible to watch them scroll by.
In the few days we’ve been using them at the office, everyone has been transfixed.
Engineering productivity has plummeted!
It’s the Farmville of Twitter.
First we’re going to get user streams out there.
We have some more critical features to add and we have to
add capacity to handle potentially millions of connections.
We’ve announced the details for a user stream preview period.
Read them carefully before coding or planning anything.
There are also some interesting events that we don’t yet publish.
We’ll see what we can get out there for you.
Once user streams are in a good spot, we want to get back to some interesting large scale integration features.
Real time Twitter.
Lots of event types.
More engagement.
More discovery.
New user experiences are now possible.
Go out and build something great!