Realtime Search at Twitter - Michael Busch

996 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

At Twitter we serve more than 1.5 billion queries per day from Lucene indexes, while appending more than 200 million tweets per day in realtime. Additionally we recently launched image, video and relevance search on the same engine.
This talk will explain the changes we made to Lucene to support this high load and the changes and improvements we made in the last year.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
996
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
71
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Realtime Search at Twitter - Michael Busch

  1. 1. Tweets per day
  2. 2. Queries per day
  3. 3. Indexing latency
  4. 4. Avg. query response time
  5. 5. Earlybird - Realtime Search @twitterMichael Busch@michibuschmichael@twitter.combuschmi@apache.org
  6. 6. Earlybird - Realtime Search @twitter Agenda ‣ Introduction - Search Architecture - Inverted Index 101 - Memory Model & Concurrency - Top Tweets
  7. 7. Introduction
  8. 8. Introduction• Twitter acquired Summize in 2008• 1st gen search engine based on MySQL
  9. 9. Introduction• Next gen search engine based on Lucene• Improves scalability and performance by orders or magnitude• Open Source
  10. 10. Realtime Search @twitter Agenda - Introduction ‣ Search Architecture - Inverted Index 101 - Memory Model & Concurrency - Top Tweets
  11. 11. Search Architecture
  12. 12. Search Architecture Tweets Ingester• Ingester pre-processes Tweets for search• Geo-coding, URL expansion, tokenization, etc.
  13. 13. Search Architecture Tweets Thrift Ingester MySQL Master MySQL Slaves• Tweets are serialized to MySQL in Thrift format
  14. 14. Earlybird Tweets Thrift Ingester MySQL Master MySQL Earlybird Slaves Index• Earlybird reads from MySQL slaves• Builds an in-memory inverted index in real time
  15. 15. Blender Thrift Thrift Blender Earlybird Index• Blender is our Thrift service aggregator• Queries multiple Earlybirds, merges results
  16. 16. Realtime Search @twitter Agenda - Introduction - Search Architecture ‣ Inverted Index 101 - Memory Model & Concurrency - Top Tweets
  17. 17. Inverted Index 101
  18. 18. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light.Table with 6 documentsExample from:Justin Zobel , Alistair Moffat,Inverted files for text search engines,ACM Computing Surveys (CSUR)v.38 n.2, p.6-es, 2006
  19. 19. Inverted Index 1011 The old night keeper keeps the keep in the town term freq2 In the big old house in the big old gown. and 1 <6> big 2 <2> <3>3 The house in the town had the big old keep dark 1 <6>4 Where the old night keeper never did sleep. did 1 <4>5 The night keeper keeps the keep in the night gown 1 <2>6 And keeps in the dark and sleeps in the light. had 1 <3> house 2 <2> <3>Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5> old 4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4> Dictionary and posting lists
  20. 20. Inverted Index 101 Query: keeper1 The old night keeper keeps the keep in the town term freq2 In the big old house in the big old gown. and 1 <6> big 2 <2> <3>3 The house in the town had the big old keep dark 1 <6>4 Where the old night keeper never did sleep. did 1 <4>5 The night keeper keeps the keep in the night gown 1 <2>6 And keeps in the dark and sleeps in the light. had 1 <3> house 2 <2> <3>Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5> old 4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4> Dictionary and posting lists
  21. 21. Inverted Index 101 Query: keeper1 The old night keeper keeps the keep in the town term freq2 In the big old house in the big old gown. and 1 <6> big 2 <2> <3>3 The house in the town had the big old keep dark 1 <6>4 Where the old night keeper never did sleep. did 1 <4>5 The night keeper keeps the keep in the night gown 1 <2>6 And keeps in the dark and sleeps in the light. had 1 <3> house 2 <2> <3>Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5> old 4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4> Dictionary and posting lists
  22. 22. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
  23. 23. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Delta encoding: 5 10 8985 2 90998 90
  24. 24. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Delta encoding: 5 10 8985 2 90998 90VInt compression: 00000101 Values 0 <= delta <= 127 need one byte
  25. 25. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Delta encoding: 5 10 8985 2 90998 90VInt compression: 11000110 00011001 Values 128 <= delta <= 16384 need two bytes
  26. 26. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Delta encoding: 5 10 8985 2 90998 90VInt compression: 11000110 00011001 First bit indicates whether next byte belongs to the same value
  27. 27. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Delta encoding: 5 10 8985 2 90998 90VInt compression: 11000110 00011001• Variable number of bytes - a VInt-encoded posting can not be written as a primitive Java type; therefore it can not be written atomically
  28. 28. Posting list encodingDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Delta encoding: 5 10 8985 2 90998 90 Read direction• Each posting depends on previous one; decoding only possible in old-to-new direction• With recency ranking (new-to-old) no early termination is possible
  29. 29. Posting list encoding• By default Lucene uses a combination of delta encoding and VInt compression• VInts are expensive to decode• Problem 1: How to traverse posting lists backwards?• Problem 2: How to write a posting atomically?
  30. 30. Posting list encoding in Earlybirdint (32 bits) docID textPosition 24 bits 8 bits max. 16.7M max. 255• Tweet text can only have 140 chars• Decoding speed significantly improved compared to delta and VInt decoding (early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)
  31. 31. Posting list encoding in EarlybirdDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction
  32. 32. Early query terminationDoc IDs to encode: 5, 15, 9000, 9002, 100000, 100090Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction E.g. 3 result are requested: Here we can terminate after reading 3 postings
  33. 33. Posting list encoding - Summary• ints can be written atomically in Java• Backwards traversal easy on absolute docIDs (not deltas)• Every posting is a possible entry point for a searcher• Skipping can be done without additional data structures as binary search, even though there are better approaches which should be explored• On tweet indexes we need about 30% more storage for docIDs compared to delta+Vints; compensated by compression of complete segments• Max. segment size: 2^24 = 16.7M tweets
  34. 34. Realtime Search @twitter Agenda - Introduction - Search Architecture - Inverted Index 101 ‣ Memory Model & Concurrency - Top Tweets
  35. 35. Memory Model & Concurrency
  36. 36. Inverted index components Posting list storage ? Dictionary
  37. 37. Inverted index components Posting list storage ? Dictionary
  38. 38. Inverted Index1 The old night keeper keeps the keep in the town term freq2 In the big old house in the big old gown. and 1 <6> big 2 <2> <3>3 The house in the town had the big old keep dark Per <6> we store different 1 term4 Where the old night keeper never did sleep. did kinds<4>metadata: text pointer, 1 of5 The night keeper keeps the keep in the night gown 1 <2> frequency, postings pointer, etc.6 And keeps in the dark and sleeps in the light. had 1 <3> house 2 <2> <3>Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5> old 4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4> Dictionary and posting lists
  39. 39. Term dictionarytermIDint[] textPointer; postingsPointer; frequency; int[] int[] int[] 0 1 2 3 4 5 6 term text pool
  40. 40. Term dictionarytermIDint[] textPointer; postingsPointer; frequency; int[] int[] int[] 0 t0 p0 f0 1 2 3 4 5 6 0 t0 c a t
  41. 41. Term dictionarytermIDint[] textPointer; postingsPointer; frequency; int[] int[] int[] 0 t0 p0 f0 1 1 t1 p1 f1 2 3 4 5 6 0 t0 t1 c a t f o o
  42. 42. Term dictionarytermIDint[] textPointer; postingsPointer; frequency; int[] int[] int[] 0 t0 p0 f0 1 1 t1 p1 f1 2 t2 p2 f2 3 4 5 6 0 2 t0 t1 t2 c a t f o o b a r
  43. 43. Term dictionary• Number of objects << number of terms• O(1) lookups• Easy to store more term metadata by adding additional parallel arrays
  44. 44. Inverted index components Posting list storage ? Dictionary Parallel arrays pointer to the most recently indexed posting for a term
  45. 45. Inverted index components Posting list storage ? Dictionary Parallel arrays pointer to the most recently indexed posting for a term
  46. 46. Posting lists storage - Objectives• Store many single-linked lists of different lengths space-efficiently• The number of java objects should be independent of the number of lists or number of items in the lists• Every item should be a possible entry point into the lists for iterators, i.e. items should not be dependent on other items (e.g. no delta encoding)• Append and read possible by multiple threads in a lock-free fashion (single append thread, multiple reader threads)• Traversal in backwards order
  47. 47. Memory management4 int[] pools = 32K int[]
  48. 48. Memory management4 int[] pools Each pool can be grown = 32K int[] individually by adding 32K blocks
  49. 49. Memory management4 int[] pools • For simplicity we can forget about the blocks for now and think of the pools as continuous, unbounded int[] arrays • Small total number of Java objects (each 32K block is one object)
  50. 50. Memory management slice size 211 27 24 21• Slices can be allocated in each pool• Each pool has a different, but fixed slice size
  51. 51. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list
  52. 52. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list Store first two postings in this slice
  53. 53. Adding and appending to a list slice size 211 27 available 24 allocated 21 current listWhen first slice is full, allocate another one in second pool
  54. 54. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list Allocate a slice on each level as list grows
  55. 55. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list On upper most level one list can own multiple slices
  56. 56. Posting list formatint (32 bits) docID textPosition 24 bits 8 bits max. 16.7M max. 255• Tweet text can only have 140 chars• Decoding speed significantly improved compared to delta and VInt decoding (early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)
  57. 57. Addressing items• Use 32 bit (int) pointers to address any item in any list unambiguously: int (32 bits) poolIndex sliceIndex offset in slice 2 bits 19-29 bits 1-11 bits 0-3 depends on pool depends on pool• Nice symmetry: Postings and address pointers both fit into a 32 bit int
  58. 58. Linking the slices slice size 211 27 available 24 allocated 21 current list
  59. 59. Linking the slices slice size 211 27 available 24 allocated 21 current list Dictionary Parallel arrays pointer to the last posting indexed for a term
  60. 60. Concurrency - Definitions• Pessimistic locking • A thread holds an exclusive lock on a resource, while an action is performed [mutual exclusion] • Usually used when conflicts are expected to be likely• Optimistic locking • Operations are tried to be performed atomically without holding a lock; conflicts can be detected; retry logic is often used in case of conflicts • Usually used when conflicts are expected to be the exception
  61. 61. Concurrency - Definitions• Non-blocking algorithm Ensures, that threads competing for shared resources do not have their execution indefinitely postponed by mutual exclusion.• Lock-free algorithm A non-blocking algorithm is lock-free if there is guaranteed system-wide progress.• Wait-free algorithm A non-blocking algorithm is wait-free, if there is guaranteed per-thread progress. * Source: Wikipedia
  62. 62. Concurrency• Having a single writer thread simplifies our problem: no locks have to be used to protect data structures from corruption (only one thread modifies data)• But: we have to make sure that all readers always see a consistent state of all data structures -> this is much harder than it sounds!• In Java, it is not guaranteed that one thread will see changes that another thread makes in program execution order, unless the same memory barrier is crossed by both threads -> safe publication• Safe publication can be achieved in different, subtle ways. Read the great book “Java concurrency in practice” by Brian Goetz for more information!
  63. 63. Java Memory Model• Program order rule Each action in a thread happens-before every action in that thread that comes later in the program order.• Volatile variable rule A write to a volatile field happens-before every subsequent read of that same field.• Transitivity If A happens-before B, and B happens-before C, then A happens-before C. * Source: Brian Goetz: Java Concurrency in Practice
  64. 64. Concurrency Thread 1 Thread 2CacheRAM 0 time int x;
  65. 65. Concurrency Thread A writes x=5 to cache Thread 1 Thread 2Cache 5 x = 5;RAM 0 time int x;
  66. 66. Concurrency Thread 1 Thread 2Cache 5 x = 5; while(x != 5);RAM 0 time int x; This condition will likely never become false!
  67. 67. Concurrency Thread 1 Thread 2CacheRAM 0 time int x;
  68. 68. Concurrency Thread A writes b=1 to RAM, because b is volatile Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int x; volatile int b;
  69. 69. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; Read volatile b volatile int b;
  70. 70. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; volatile int b; happens-before• Program order rule: Each action in a thread happens-before every action in that thread that comes later in the program order.
  71. 71. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; volatile int b; happens-before• Volatile variable rule: A write to a volatile field happens-before every subsequent read of that same field.
  72. 72. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; volatile int b; happens-before• Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.
  73. 73. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; volatile int b; This condition will be false, i.e. x==5• Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  74. 74. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; Memory barrier volatile int b;• Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  75. 75. Demo
  76. 76. Concurrency Thread 1 Thread 2Cache 5 x = 5; b = 1;RAM 0 1 time int dummy = b; while(x != 5); int x; Memory barrier volatile int b;• Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  77. 77. Concurrency IndexWriter IndexReader write 100 docs maxDoc = 100 time write more docs in IR.open(): read maxDoc search upto maxDoc maxDoc is volatile
  78. 78. Concurrency IndexWriter IndexReader write 100 docs maxDoc = 100 time write more docs in IR.open(): read maxDoc search upto maxDoc maxDoc is volatile happens-before• Only maxDoc is volatile. All other fields that IW writes to and IR reads from don’t need to be!
  79. 79. Wait-free• Not a single exclusive lock• Writer thread can always make progress• Optimistic locking (retry-logic) in a few places for searcher thread• Retry logic very simple and guaranteed to always make progress
  80. 80. Realtime Search @twitter Agenda - Introduction - Search Architecture - Inverted Index 101 - Memory Model & Concurrency ‣ Top Tweets
  81. 81. Top Tweets
  82. 82. Signals• Query-dependent • E.g. Lucene text score, language• Query-independent • Static signals (e.g. text quality) • Dynamic signals (e.g. retweets) • Timeliness
  83. 83. Signals Task: Find best tweet in billions of tweets efficiently Many Earlybird segments (8M documents each)
  84. 84. Signals• For top tweets we can’t early terminate as efficiently• Scoring and ranking billions of tweets is impractical Many Earlybird segments (8M documents each)
  85. 85. Query cache Idea: Mark all tweets with high query-independent scores and only visit those for top tweets queries Query results cache High static score doc Many Earlybird segments (8M documents each)
  86. 86. Query cache• A background thread periodically wakes up, executes queries, and stores the results in the per-segment cache This clause will be executed as a• Rewriting queries Lucene ConstantScoreQuery that wraps a BitSet or SortedVIntList • User query: q = ‘lucene’ • Rewritten: q’ = ‘lucene AND cached_filter:toptweets’• Efficient skipping over tweets with low query-independent scores Many Earlybird segments (8M documents each)
  87. 87. Query cache• A background thread periodically wakes up, executes queries, and stores the results in the per-segment cache• Configurable per cached query: • Result set type: BitSet, SortedVIntList • Execution schedule • Filter mode: cache-only or hybrid Many Earlybird segments (8M documents each)
  88. 88. Query cache • Result set type: BitSet, SortedVIntList • BitSet for results with many hits • SortedVIntList for very sparse results Many Earlybird segments (8M documents each)
  89. 89. Query cache • Execution schedule • Per segment: Sleep time between refreshing the cached results Many Earlybird segments (8M documents each)
  90. 90. Query cache • Filter mode: cache-only or hybrid Many Earlybird segments (8M documents each)
  91. 91. Query cache • Filter mode: cache-only or hybrid Read direction
  92. 92. Query cache • Filter mode: cache-only or hybrid Partially filled segment (IndexWriter is currently appending) Read direction
  93. 93. Query cache • Filter mode: cache-only or hybrid Query cache can only be computed until current maxDocID Read direction
  94. 94. Query cache • Filter mode: cache-only or hybrid Some time later: More docs in segment, but query cache was not updated yet Read direction
  95. 95. Query cache • Filter mode: cache-only or hybrid Cache-only mode: Ignore documents added to segment Search range since cache was updated Read direction
  96. 96. Query cache • Filter mode: cache-only or hybrid Hybrid mode: Fall back to original query and execute it on Search range latest documents Read direction
  97. 97. Query cache • Filter mode: cache-only or hybrid Use query cache up to the cache’s maxDocID Search range Read direction
  98. 98. Query result cachequery cache config yml file Many Earlybird segments (8M documents each)
  99. 99. Questions?

×