Realtime Search 罗磊
Reference <ul><li>A Billion Queries Per Day  </li></ul><ul><li>Bigtable: A Distributed Sorage System for Structured Data <...
Realtime Search @ Twitter
 
 
Posting list format and early query terminal <ul><li>Only evaluate as few documents as possible before terminating the que...
 
 
 
 
 
 
Index memory model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Lock-free algorithms and  data structures
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Realtime Search @ Twitter <ul><li>Q & A </li></ul>
Realtime Search @ Google
Arguement about MapReduce  <ul><li>MapReduce: A major step backwards </li></ul><ul><li>MapReduce and Parallel DBMSs: Frien...
Google's realtime search <ul><li>Google's indexing system stores tens of petabytes of data and processes billions of updat...
Avoid indexes duplicates <ul><li>MapReduce can't deal with any update </li></ul><ul><li>MapReduce must be run again over t...
Where to store index at Google <ul><li>DBMS </li></ul><ul><ul><li>Can't handle the sheer volume of data: Google's indexing...
Percolator <ul><li>Incremental processing model </li></ul><ul><ul><li>it maintains a very large repository of documents an...
Percolator <ul><li>Percolator provides two main abstractions for performing incremental processing at large scale:  </li><...
Percolator
Transaction is a must
Transaction is a must
Transaction is a must
Transcation on Percolar
Transcation on Percolar
Transcation on Percolar
Transcation on Percolar
Transcation on Percolar
Transcation on Percolar <ul><li>Transcation processing is complicated by the possibility of client failure </li></ul><ul><...
Notifications <ul><li>Transactions let the user mutate the table while maintaining invariants, but users also need a way t...
Optimization <ul><li>Reduce RPC </li></ul><ul><li>batching when reading from the same table server </li></ul><ul><li>Hacki...
Evaluation
Evalution
Realtime Search @ Google <ul><li>Q & A </li></ul>
Realtime Search  <ul><li>Realtime Search @ Twitter </li></ul><ul><li>Realtime Search @ Google </li></ul><ul><ul><ul><ul><l...
Upcoming SlideShare
Loading in …5
×

Realtime search

902 views

Published on

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
902
On SlideShare
0
From Embeds
0
Number of Embeds
46
Actions
Shares
0
Downloads
22
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Realtime search

  1. 1. Realtime Search 罗磊
  2. 2. Reference <ul><li>A Billion Queries Per Day </li></ul><ul><li>Bigtable: A Distributed Sorage System for Structured Data </li></ul><ul><li>Large-scale Incremental Processing Using Distributed Transactions and Notifications </li></ul>
  3. 3. Realtime Search @ Twitter
  4. 6. Posting list format and early query terminal <ul><li>Only evaluate as few documents as possible before terminating the query </li></ul><ul><li>Rank documents in reverse time order (newest documents first) </li></ul>
  5. 13. Index memory model
  6. 31. Lock-free algorithms and data structures
  7. 46. Realtime Search @ Twitter <ul><li>Q & A </li></ul>
  8. 47. Realtime Search @ Google
  9. 48. Arguement about MapReduce <ul><li>MapReduce: A major step backwards </li></ul><ul><li>MapReduce and Parallel DBMSs: Friend or Foe </li></ul><ul><li>MapReduce: A Flexible Data Processing Tool </li></ul>
  10. 49. Google's realtime search <ul><li>Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency. </li></ul>
  11. 50. Avoid indexes duplicates <ul><li>MapReduce can't deal with any update </li></ul><ul><li>MapReduce must be run again over the entire repository </li></ul>
  12. 51. Where to store index at Google <ul><li>DBMS </li></ul><ul><ul><li>Can't handle the sheer volume of data: Google's indexing System store tens of petabytes across thousands of machines. </li></ul></ul><ul><li>Bigtable </li></ul><ul><ul><li>Can't provide tools to help programmers maintain data invariants in the face of concurrent updates </li></ul></ul>
  13. 52. Percolator <ul><li>Incremental processing model </li></ul><ul><ul><li>it maintains a very large repository of documents and update it efficiently </li></ul></ul><ul><li>Processing many updates concurrently </li></ul>
  14. 53. Percolator <ul><li>Percolator provides two main abstractions for performing incremental processing at large scale: </li></ul><ul><ul><li>ACID transactions over a random-access repository </li></ul></ul><ul><ul><li>observers, a way to organize an incremental computation </li></ul></ul>
  15. 54. Percolator
  16. 55. Transaction is a must
  17. 56. Transaction is a must
  18. 57. Transaction is a must
  19. 58. Transcation on Percolar
  20. 59. Transcation on Percolar
  21. 60. Transcation on Percolar
  22. 61. Transcation on Percolar
  23. 62. Transcation on Percolar
  24. 63. Transcation on Percolar <ul><li>Transcation processing is complicated by the possibility of client failure </li></ul><ul><ul><li>table server failure does not affect the system since Bigtable guarantees that written locks persist across talbe server failures </li></ul></ul><ul><li>A transaction will not clean up a lock unless it suspects that a lock belongs to a dead or stuck worker </li></ul>
  25. 64. Notifications <ul><li>Transactions let the user mutate the table while maintaining invariants, but users also need a way to trigger and run the transactions </li></ul><ul><li>To provide notifications, Percolator maintans a special &quot;observation&quot; column. </li></ul>
  26. 65. Optimization <ul><li>Reduce RPC </li></ul><ul><li>batching when reading from the same table server </li></ul><ul><li>Hacking on Linux kernel to support high thread counts </li></ul>
  27. 66. Evaluation
  28. 67. Evalution
  29. 68. Realtime Search @ Google <ul><li>Q & A </li></ul>
  30. 69. Realtime Search <ul><li>Realtime Search @ Twitter </li></ul><ul><li>Realtime Search @ Google </li></ul><ul><ul><ul><ul><li>Thank you </li></ul></ul></ul></ul>

×