TWITTER
SEARCH
By: Ramez Al-Fayez
TWITTER
User-generated content
­  140 characters called Tweet
­  Informal language, free-form
­  Diverse topics
­  Images, videos and links
­  SPAM L
Very high volume
Ø Information overload
2
“When you've got 5 minutes to fill,
Twitter is a great way to fill 35
minutes”
@mattcutts 
TWITTER STATS
3
2 BILLION
QUERIES PER DAY
230 MILLION
TWEETS PER DAY
< 10 S
INDEXING LATENCY
50 MS
AVG. QUERY
RESPONSE TIME
1 BILLION
REGISTERED USER
143,199
TWEETS PER SECONDS
4
WHAT TO SEARCH IN TWITTER?
­  Tweets
­  Images (Tweets that have images)
­  Users
­  News(Tweets that have links)
5
SEARCHING FOR “IPAD” ON TWITTER
6
More than 50 tweets
mentioning “iPad”
posted within
1-minute
CUSTOMIZED IR FOR TWITTER
Feature of Twitter’s IR
§ Modularity
§ Scalability
§ Cost effectiveness
§ Simple interface
§ Incremental development
7
CUSTOMIZED IR FOR TWITTER
The system consists four main parts
§ Batched data aggregation and preprocess pipeline
§ An inverted index builder;
§ Earlybird shards
§ Earlybird roots
8
CRAWLING TWITTER
 HoseBird API Client
	
  	
  Client	
  hosebirdClient	
  =	
  builder.build();	
  
StatusesFilterEndpoint	
  endpoint	
  =	
  new	
  StatusesFilterEndpoint();	
  
//	
  Optional:	
  set	
  up	
  some	
  followings	
  and	
  track	
  terms	
  
List<Long>	
  followings	
  =	
  Lists.newArrayList(1234L,	
  566788L);	
  
List<String>	
  terms	
  =	
  Lists.newArrayList("twitter",	
  "api");	
  
endpoint.followings(followings);	
  
endpoint.trackTerms(terms);	
  
INDEXING TWITTER
	
  	
  
In November 18, 2014 Twitter inc. announce that Twitter now
indexes every public Tweet since 2006
§ Temporal sharding: The Tweet corpus was first divided into multiple time tiers.
§ Hash partitioning: Within each time tier, data was divided into partitions based on a
hash function.
§ Earlybird: Within each hash partition, data was further divided into chunks called
Segments. Segments were grouped together based on how many could fit on each
Earlybird machine.
§ Replicas: Each Earlybird machine is replicated to increase serving capacity and
resilience
DATA AGGREGATION
11
§ Engagement aggregator: Counts the number of engagements for each Tweet in a
given day. These engagement counts are used later as an input in scoring each Tweet.
§ Aggregation: Joins multiple data sources together based on Tweet ID.
§ Ingestion: Performs different types of preprocessing — language identification,
tokenization, text feature extraction, URL resolution and more.
§ Scorer: Computes a score based on features extracted during Ingestion. For the
smaller historical indices, this score determined which Tweets were selected into the
index.
§ Partitioner: Divides the data into smaller chunks through our hashing algorithm. The
final output is stored into HDFS.
DATA AGGREGATION
12
INVERT INDEX
13
§ Segment partitioner: Groups multiple batches of preprocessed daily Tweet data from
the same partition into bundles. We call these bundles “segments.”
§ Segment indexer: Inverts each Tweet in a segment, builds an inverted index and
stores the inverted index into HDFS.
INVERT INDEX
14
SEARCH PROCESS
15
 Earlybirds shards:
­  The inverted index builders produced hundreds of inverted index segments. These segments
were then distributed to machines called Earlybirds. Since each Earlybird machine could
only serve a small portion of the full Tweet corpus, we had to introduce sharding
­  two-dimensional sharding scheme to distribute index segments onto serving Earlybirds
­  Multiple time tiers
­  Hash partitioning
­  Each Earlybird machine is replicated to increase serving capacity and resilience
 Earlybird roots:
­  The roots perform a two level scatter-gather as shown in the below diagram, merging
search results and term statistics histograms
SEARCH PROCESS
16
SEARCH PROCESS
17
RANKING
18
§ Different types of content are searched separately
§ Uniscores: used as a means to blend different content types into the search result
§ Score unification: Individual content is assigned a “raw” score, then converted into
uniscores
§ Burst: is used to filter out content types with low or no bursts. It’s also used to boost the
score of corresponding content types, as a feature for a multi-class classifier that
predicts the most likely content type for a query, and in additional components of the
ranking system.
RANKING
19
Search ranker chose News1 followed by Tweet1 so far and is presented with three candidatesTweet2,
User Group, and News2 to pick the content after Tweet1.
News2 has the highest uniscore but search ranker picks Tweet2, instead of News2 as we penalize change
in type between consecutive content by decreasing the score of News2 from 0.65 to 0.55, for instance
RANKING
20
Normalized image and news counts are matched to one of n=5 states : 1
average, 2 above, and 2 below. Matched states curves show a more
stable quantization of original sequence which has the effect of removal
of small noisy peaks
Query of “Photo” shows three sequences of number of
Tweets over eight 15 minute buckets from bucket 1 (2
hours ago) to 8 (most recent).
REFERENCES
§ Anirudh Todi, TSAR, a TimeSeries AggregatoR ,
https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
§ Youngin Shin, New Twitter search results,
https://blog.twitter.com/2013/new-twitter-search-results
§ Yi Zhuang, Building a complete Tweet index,
https://blog.twitter.com/2014/building-a-complete-tweet-index
§ J. Kleinberg, Bursty and Hierarchical Structure in Streams, Proc. 8th ACM SIGKDD
Intl. Conf. on Knowledge Discovery and Data Mining, 2002
§ Brendan O'Connor, Michel Krieger, and David Ahn. 2010b. TweetMotif:
Exploratory search and topic summarization for Twitter. In Proc. of ICWSM
21
THANK YOU!

Twitter Search Architecture

  • 1.
  • 2.
    TWITTER User-generated content ­  140characters called Tweet ­  Informal language, free-form ­  Diverse topics ­  Images, videos and links ­  SPAM L Very high volume Ø Information overload 2 “When you've got 5 minutes to fill, Twitter is a great way to fill 35 minutes” @mattcutts 
  • 3.
    TWITTER STATS 3 2 BILLION QUERIESPER DAY 230 MILLION TWEETS PER DAY < 10 S INDEXING LATENCY 50 MS AVG. QUERY RESPONSE TIME 1 BILLION REGISTERED USER 143,199 TWEETS PER SECONDS
  • 4.
  • 5.
    WHAT TO SEARCHIN TWITTER? ­  Tweets ­  Images (Tweets that have images) ­  Users ­  News(Tweets that have links) 5
  • 6.
    SEARCHING FOR “IPAD”ON TWITTER 6 More than 50 tweets mentioning “iPad” posted within 1-minute
  • 7.
    CUSTOMIZED IR FORTWITTER Feature of Twitter’s IR § Modularity § Scalability § Cost effectiveness § Simple interface § Incremental development 7
  • 8.
    CUSTOMIZED IR FORTWITTER The system consists four main parts § Batched data aggregation and preprocess pipeline § An inverted index builder; § Earlybird shards § Earlybird roots 8
  • 9.
    CRAWLING TWITTER  HoseBird APIClient    Client  hosebirdClient  =  builder.build();   StatusesFilterEndpoint  endpoint  =  new  StatusesFilterEndpoint();   //  Optional:  set  up  some  followings  and  track  terms   List<Long>  followings  =  Lists.newArrayList(1234L,  566788L);   List<String>  terms  =  Lists.newArrayList("twitter",  "api");   endpoint.followings(followings);   endpoint.trackTerms(terms);  
  • 10.
    INDEXING TWITTER     In November 18, 2014 Twitter inc. announce that Twitter now indexes every public Tweet since 2006 § Temporal sharding: The Tweet corpus was first divided into multiple time tiers. § Hash partitioning: Within each time tier, data was divided into partitions based on a hash function. § Earlybird: Within each hash partition, data was further divided into chunks called Segments. Segments were grouped together based on how many could fit on each Earlybird machine. § Replicas: Each Earlybird machine is replicated to increase serving capacity and resilience
  • 11.
    DATA AGGREGATION 11 § Engagement aggregator:Counts the number of engagements for each Tweet in a given day. These engagement counts are used later as an input in scoring each Tweet. § Aggregation: Joins multiple data sources together based on Tweet ID. § Ingestion: Performs different types of preprocessing — language identification, tokenization, text feature extraction, URL resolution and more. § Scorer: Computes a score based on features extracted during Ingestion. For the smaller historical indices, this score determined which Tweets were selected into the index. § Partitioner: Divides the data into smaller chunks through our hashing algorithm. The final output is stored into HDFS.
  • 12.
  • 13.
    INVERT INDEX 13 § Segment partitioner:Groups multiple batches of preprocessed daily Tweet data from the same partition into bundles. We call these bundles “segments.” § Segment indexer: Inverts each Tweet in a segment, builds an inverted index and stores the inverted index into HDFS.
  • 14.
  • 15.
    SEARCH PROCESS 15  Earlybirds shards: ­ The inverted index builders produced hundreds of inverted index segments. These segments were then distributed to machines called Earlybirds. Since each Earlybird machine could only serve a small portion of the full Tweet corpus, we had to introduce sharding ­  two-dimensional sharding scheme to distribute index segments onto serving Earlybirds ­  Multiple time tiers ­  Hash partitioning ­  Each Earlybird machine is replicated to increase serving capacity and resilience  Earlybird roots: ­  The roots perform a two level scatter-gather as shown in the below diagram, merging search results and term statistics histograms
  • 16.
  • 17.
  • 18.
    RANKING 18 § Different types ofcontent are searched separately § Uniscores: used as a means to blend different content types into the search result § Score unification: Individual content is assigned a “raw” score, then converted into uniscores § Burst: is used to filter out content types with low or no bursts. It’s also used to boost the score of corresponding content types, as a feature for a multi-class classifier that predicts the most likely content type for a query, and in additional components of the ranking system.
  • 19.
    RANKING 19 Search ranker choseNews1 followed by Tweet1 so far and is presented with three candidatesTweet2, User Group, and News2 to pick the content after Tweet1. News2 has the highest uniscore but search ranker picks Tweet2, instead of News2 as we penalize change in type between consecutive content by decreasing the score of News2 from 0.65 to 0.55, for instance
  • 20.
    RANKING 20 Normalized image andnews counts are matched to one of n=5 states : 1 average, 2 above, and 2 below. Matched states curves show a more stable quantization of original sequence which has the effect of removal of small noisy peaks Query of “Photo” shows three sequences of number of Tweets over eight 15 minute buckets from bucket 1 (2 hours ago) to 8 (most recent).
  • 21.
    REFERENCES § Anirudh Todi, TSAR,a TimeSeries AggregatoR , https://blog.twitter.com/2014/tsar-a-timeseries-aggregator § Youngin Shin, New Twitter search results, https://blog.twitter.com/2013/new-twitter-search-results § Yi Zhuang, Building a complete Tweet index, https://blog.twitter.com/2014/building-a-complete-tweet-index § J. Kleinberg, Bursty and Hierarchical Structure in Streams, Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2002 § Brendan O'Connor, Michel Krieger, and David Ahn. 2010b. TweetMotif: Exploratory search and topic summarization for Twitter. In Proc. of ICWSM 21
  • 22.