2. Effective Use of the
Twitter Search API
Eric Jensen
Twitter Search
Submit your questions via
http://bit.ly/chirpsearch
or hashtag #chirpsearch
3. Agenda
• Mission of the Twitter Search API
• History
• Most recently: ranking the top results
• What’s next
4. Search API Mission
Connect users with what's most
important and interesting to
them in the here and now
(return the best stuff for a query)
5. Search Stats
• Over 600 million queries per day
• Typically less than 200 milliseconds per query
• Typically less than 20 seconds indexing
latency
• Index of hundreds of millions of tweets
7. Search vs. Streaming
• Do use the search API for your app when:
• The user can input a query
• You need immediate results, not tracking
• Don’t use the search API for your app when:
• Your user experience requires comprehensive
results (all the tweets, not just the best ones)
• You only need tweets from/to/at particular users
9. Why is this OK?
search.json?q=twitter search.json?since_id=9290798834
&q=twitter
Timeline Cache Timeline Cache
q=twitter 1 2 3 4
Search Tweets
Index
10. Search API History
Quality Filtering on Trends
Nov 5, 2009
Summize Launches Twitter Search Top Results Include Popular
Apr 4, 2008 Apr 1, 2010
Summize Acquired by Twitter Search on Twitter.com Local Trends Chirp!
Jul 14, 2008 Apr 1, 2009 Jan 6, 2010 Apr 15, 2010
Twitter Search API
Sep 1, 2008 Jan 1, 2009 May 1, 2009 Sep 1, 2009 Jan 1, 2010
11. Ranking Top Results
• Best stuff for a query
• Many factors
• First step
• Available from API
12. Top Results API
• New parameter: result_type
• mixed: Eventually this will become the
default value. Include both popular and real
time results in the response.
• recent: The current default value. Return
only the most recent results in the response.
• popular: Return only the most popular
results in the response.
18. The Near Future
• Remove duplicates (retweets)
• Deeper index
• Hit highlighting in the API
• More consistency (with the REST API)
• Better rate limiting
19. The Future (cont)
• More relevance
• More metadata
• More stuff
• More operators
• places, @anywhere, annotations
20. Open Source in Search
• http://twitter.com/about/opensource
• mysql, hadoop, kestrel, twitter-text, etc.
• lucene
• commons-pipeline
• varnish
• jmeter
• nutch language identifier
• mecab
i will talk about:
- start by giving some of our thinking about why we have a search api and what differentiates it from the other api’s twitter offers
- i’ll get into some technical implications of these differences with respect to polling on search versus tracking keywords on the streaming api
- next, i’ll talk briefly about how the search api has changed over time
- and then we’ll dig into the most recent change where we began ranking the top results beyond recency order. i’ll show you how i’ve modified one of our own search api clients to take advantage of that change
simple definition: user provides a query by engaging with an api application, we provide the best stuff (currently tweets and trends) for that query
Obviously the “best” stuff for twitter has a lot to do with how recent it is, so our primary focus is on the “here and now”
Just to give you an idea of the parameters search operates under:
- as ev told you yesterday we are doing more than 600M queries per day, seen up to 750M on a day recently
- while realtime is our main focus, our index does contain hundreds of millions of tweets and we’ve roughly doubled its size in the last six months.
- of course, the amount of tweets has grown even faster than we’ve increased that index size, so this only covers about a week of them right now, but that is something we’re currently working on expanding
So obviously we’re operating a large scale, but what’s really interesting to me about the search API is the variety of applications you as developers have found for it. I’ve listed just a few here to illustrate what people are currently doing with the API.
So that’s what people are doing with the search api, but the streaming api also supports tracking keywords and some location and language filtering. So, if you’re developing a new app, how do you decide which to use?
The biggest difference between the search API and the track API is how you get new results matching your standing query. On the streaming API the push model makes this obvious: new results are sent to you as they come in. Since the focus of the search API is on apps that let the user manipulate the query (whether explicitly or implicitly), registering a standing query for every request makes less sense. Instead, the search API uses a polling model with a cursor.
---
make sure you explain this diagram by pointing at it (or at least describing it). It took me a minute to get the visual presentation
One question that comes up frequently is why we encourage apps to use this cursor to poll and how that helps us to support refreshes more efficiently, so here’s a diagram of what happens under the covers. A lot like the streaming API, when you make any query to search we actually do register that as a standing query, but only in one of our caching layers we call the timeline cache.
Next I’d like to take a step back and talk briefly about the history of the search API and how our thinking about it has developed.
twitter search and the API have been around for about two years now, and we made a lot of changes early on like supporting location search, but after that we had to shift our focus to scaling the system to support the growth in tweets and queries. It’s really just in the last six months that we’ve made enough progress with scaling and grown the search team enough to be able to focus more on relevance and figuring out what that means for twitter search.
Our mission:
----
Under “many factors” you should note that it’s not always the popular users that show up here -- that seems to be an early misconception. Our algorithm looks to find things that are interesting from any user - things that “resonate,” to use a word that Dick talked about yesterday (good to tie it in to other things being said at Chirp).
Rather than “not final” (which seems to imply there is a “final” step when we won’t be improving this) I’d say something like “First step of a long road of relevance improvements” (implying that we’ve got lots of ideas and we’ll be delivering cool stuff for a long way.