Search at Tumblr (nyc search meetup)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Search at Tumblr (nyc search meetup)

on

  • 12,282 views

Yufei Pan, Director of Search at Tumblr, presenting at NYC Search & Analytics meetup in January 2014.

Yufei Pan, Director of Search at Tumblr, presenting at NYC Search & Analytics meetup in January 2014.
http://www.meetup.com/NYC-Search-and-Discovery/

Statistics

Views

Total Views
12,282
Views on SlideShare
10,390
Embed Views
1,892

Actions

Likes
20
Downloads
48
Comments
1

16 Embeds 1,892

http://yisangwook.tumblr.com 827
http://www.scoop.it 534
https://twitter.com 453
http://feeds.feedburner.com 20
http://www.arfadia.com 19
http://nebeleule.de 8
http://mocobeta-bookmark.tumblr.com 7
http://www.hanrss.com 7
http://webcache.googleusercontent.com 4
http://www.linkedin.com 3
http://feedly.com 3
http://www.yisangwook.tumblr.com 2
http://safe.txmblr.com 2
https://www.linkedin.com 1
http://www.soso.com 1
http://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Really nice. Lots of hard won lessons there. I especially like the emphasis on SOLR, Redis + SQL (although we use PostgresSQL instead).

    Validated what we've been doing at Gnowit on our de-duplication + topic interest modelling + search infrastructure as well. Lots of parallels with the logic of our filed patents as well.

    Thank you for sharing this Yufei.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Search at Tumblr (nyc search meetup) Presentation Transcript

  • 1. Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013
  • 2. Tumblr - Follow the World’s Creators Founded ● David Karp ● February 2007 Publishing Platform ● 163 million blogs ● 72 billion posts Social Network ● Follow, Mention ● Like, Reblog
  • 3. About search@tumblr ● Most important way to discover great content ○ 50M searches a day ● Limited search for a long time (2007-2012) ○ Tagged page ■ mysql lookup of a single tag id ■ sorted by reverse chronological order ○ Finding blog ■ navigate through curated directories
  • 4. About search@tumblr ● Search Team ○ 2012 July, Jak joined as first search engineer! Jak Yufei Bennett Beitao Patrick ● Features launched in 2013 ○ Post search, Blog search, Theme search ○ Typeaheads, Recommendation, Trends Adam
  • 5. Whole New Search Post search ● full text search ● top and recent ● post type filtering Blog search ● name & title ● top tags in posts ● blog highlights Related search ● term co-occurrence
  • 6. Typeahead Autocompletes Search Autocomplete Mention Autocomplete ● ● ● Interactive guide of tumblr content High volume of traffic Low latency Tag Suggest
  • 7. Recommendations Personalized Recommendation Weekly Dashboard Digest
  • 8. Trends Trending Tags Trending Blogs
  • 9. Theme Search
  • 10. Search Architecture Post Search Blog Search Typeahead Related Tags Blog Recommend Blog Highlights Blog Top Tags Trending Tags Trending Blogs Trending Posts Online Search Online Framework Recent Post Index Blog Full Index Theme Index Blog Top-K Index Follower Counts Post Notecount Post Model Personalized Blog Index Trending Blogs Trending Posts Trending Tags Related Tag Index Blog Global Rank Blog Model User Model Typeahead Indices Data Top Post Index Blog Top Posts Blog Top Tags Two Degree Like Root Blog Feedback In-Blog Tag Index Global Tag Index Search Offline Framework Rediscover Solr Offline MySQL Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS) Nginx Linux
  • 11. Software Stack ● Search Online ○ HAProxy, Nginx, PHP ○ Memcache ○ Icinga, Scribe, OpenTSDB ● Search Data ○ Solr, Redis, MySQL ● Search Offline ○ Sqoop, Hadoop ○ Java, Hive, Pig, Scalding, Python
  • 12. Search Online Framework Search Services SearchBase Search Flow Execution Multi-level Caching Search Logging Async Execution Search Editorial QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF SimpleQuery SolrPostRetriever NotecountFetcher TopPostRanker PostFetcher PostFilter PersonalizedQuery MysqlPostRetriever FollowercountFetcher TumblelogRanker TumblelogFetcher TumblelogFilter AdvancedPostQuery SMPostRetriever TumblelogGlobalRan kFetcher RelatedPostRanker TagFetcher TagFilter RecommendationSign alFetcher TumblelogMixingRan ker TimeSliceQuery TrendTagQuery TumblelogRetriever TagTypeahead Reteriever BlogTopTagFetcher
  • 13. Search Batch Processing Search Data (Redis) Workflow Composition Dependency Resolution Automatic Versioning Data Verification Execution Logging Failure Detection/Alert Search Workflow Engine Hive Jobs Term Generators Streaming Jobs Pig Jobs Top-K Indexer Delta Propagator Search Task Base Scribe Logs, Sqoop Tables (HDFS) Scalding Jobs Lucille2 Classes
  • 14. Indexing ● 3-Tier indices ○ Index all posts ■ 600+ machines ○ Recent (6W) + Popular (4Y) + Existing tag table ■ Down to 40 machines ■ Minor loss in coverage ■ Serve up to 4K qps (non-cached) ● Lean index ○ Separate signals from index ■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing ○ Separate document text from index ■ Dropping the memory footprint
  • 15. Ranking ● Quickly evolving! ● Major ranking signals in production ○ Global popularity ■ likes, reblogs, follows ○ Local popularity ■ popularity projected on <user, query> ● ● blog search: aggregated likes on query term blog recommendation: follow counts among friends ○ Textual relevancy ■ how: exact match, query proximity ■ where: name, title, tag, mention, body, etc ○ Recency
  • 16. Duplicate Elimination (DE) ● Index-time DE ○ post signature ■ number of tags > N1 ■ md5 hash of normalized tag list ● Search-time DE ○ Media DE ■ posts with same media hashes. ○ Near DE ■ posts with tags > N2 ■ mark as near duplicate if diff <= N3 tags ■ older posts selected as original
  • 17. Search Platform ● A curvy road ○ Started with ElasticSearch ○ Switched to SolrCloud due to reliability ○ Ended up with Solr + Customized Clustering ● Our takes ○ ElasticSearch and SolrCloud have great functionality ■ distributed indexing and search ■ easy cluster management ○ Solr seems still much more reliable with high indexing load and search traffic.
  • 18. Offline Precomputation ● Benefits ○ Minimize the search online latency ○ More sophisticated/expensive computation ● Limitation ○ Loss of freshness ○ Expensive for longtail query and results ● Precomputed ○ ○ ○ ○ Typeaheads Related search Blog recommendation Top posts of Blog / User
  • 19. What’s Next ● Inblog search ○ full text search on all posts in a blog ○ original posts, reblogs, likes ● Ranking ○ more effective and spam-resilient signals ○ learning to rank ● Topical interest modeling ○ supervised and unsupervised ○ blog content and user activities ○ interest based blog recommendation ● Content discovery ○ trending content in various categories
  • 20. Q&A Question: Are you hiring? Answer: Yeah! Check it out at http://www.tumblr.com/jobs More questions please, :-)