Riak Search - Erlang Factory London 2010

10,041 views

Published on

Riak Search is a distributed data indexing and search platform built on top of Riak. The talk will introduce Riak Search, covering overall goals, architecture, and core functionality, with specific focus on how Erlang is used to manage and execute an ever-changing population of ad hoc query processes.

Published in: Technology, News & Politics
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,041
On SlideShare
0
From Embeds
0
Number of Embeds
2,143
Actions
Shares
0
Downloads
101
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Riak Search - Erlang Factory London 2010

  1. 1. Riak Search A Full-Text Search and Indexing Engine based on Riak Erlang Factory· London· June 2010 Basho Technologies Rusty Klophaus - @rklophaus
  2. 2. Why did we build it? What are the major goals? How does it work? 2
  3. 3. Part One Why did we build Riak Search? 3
  4. 4. Riak is a scalable, highly-available, networked, open-source key/value store. 4
  5. 5. Writing to a Key/Value Store Key/Value CLIENT RIAK 5
  6. 6. Writing to a Key/Value Store Object CLIENT RIAK 6
  7. 7. Querying a Key/Value Store Key Object CLIENT RIAK 7
  8. 8. Querying Riak via LinkWalking Key + Instructions Walk to Related Keys Object(s) CLIENT RIAK 8
  9. 9. Querying Riak via Map/Reduce Key(s) + JS Functions Map Map Reduce Computed Value(s) CLIENT RIAK 9
  10. 10. Key/Value Stores like Key-Based Queries 10
  11. 11. Query by Secondary Index where Category == "Shoes" WTF!? I'm a KV store! CLIENT RIAK 11
  12. 12. Full-Text Query "Converse AND Shoes" This is getting old. CLIENT RIAK 12
  13. 13. These kinds of queries need an Index. *Market Opportunity!* 13
  14. 14. Part Two What are the major goals of Riak Search? 14
  15. 15. An application built on Riak. Your Riak Application 15
  16. 16. Hrm... I need an index. Your Riak Application Index Object 16
  17. 17. Hrm... I need an index with more features. Your ??? Riak Application 17
  18. 18. Lucene should do the trick... Your Lucene Riak Application 18
  19. 19. ...shard to add more storage capacity... Your Application Lucene Lucene Lucene Riak 19
  20. 20. ...replicate to add more throughput. Lucene Lucene Lucene Your Application Lucene Lucene Lucene Riak Lucene Lucene Lucene 20
  21. 21. ...replicate to add more throughput. Lucene Lucene Lucene Your Application Lucene Lucene Lucene Riak Lucene Lucene Lucene Operations nightmare! 21
  22. 22. What do we really want? Your Riak-ified Riak Application Lucene 22
  23. 23. What do we really want? Your Riak Riak Application Search 23
  24. 24. Functionality? Be like Lucene (and more). • Lucene Syntax • Leverages Java Lucene Analyzers • Solr Endpoints • Integration via Riak Post-Commit Hook (Index) • Integration via Riak Map/Reduce (Query) • Near-Realtime • Schema-less 24
  25. 25. Operations? Be like Riak. • No special nodes • Add nodes, get more compute and storage • Automatically load balance • Replicas for durability and performance • Index and query in parallel • Swappable storage backends 25
  26. 26. Part Three How do we do it? 26
  27. 27. A Gentle Introduction to Document Indexing 27
  28. 28. The Inverted Index Document Inverted Index day, 1 dog, 1 #1 Every dog has his day. every, 1 has, 1 his, 1 28
  29. 29. The Inverted Index Documents Combined Inverted Index and, 4 #1 Every dog has his day. bag, 3 bark, 2 bite, 2 The dog's bark #2 cat, 3 is worse than his bite. cat, 4 day, 1 dog, 1 #3 Let the cat out of the bag. dog, 2 dog, 4 every, 1 #4 It's raining cats and dogs. has, 1 ... 29
  30. 30. At Query Time... "dog AND cat" AND dog cat 30
  31. 31. At Query Time... AND dog cat dog, 1 cat, 3 dog, 2 cat, 4 dog, 4 31
  32. 32. At Query Time... Result: 4 AND (Merge Intersection) 1 3 2 4 4 32
  33. 33. At Query Time... Result: 1, 2, 3, 4 OR (Merge Union) 1 3 2 4 4 33
  34. 34. Complex Behavior from Simple Structures 34
  35. 35. Storage Approaches... 35
  36. 36. Riak Search uses Consistent Hashing to store data on Partitions 36
  37. 37. Introduction to Consistent Hashing and Partitions Partitions = 10 Number of Nodes = 5 Partitions per Node = 2 Replicas (NVal) = 2 37
  38. 38. Introduction to Consistent Hashing and Partitions Object 38
  39. 39. Document Partitioning vs. Term Partitioning 39
  40. 40. ...and the Resulting Tradeoffs 40
  41. 41. Document Partitioning @ Index Time #1 Every dog has his day. 41
  42. 42. Document Partitioning @ Query Time "dog OR cat" 42
  43. 43. Term Partitioning @ Index Time day, 1 dog, 1 #1 Every dog has his day. every, 1 has, 1 his, 1 43
  44. 44. Term Partitioning @ Index Time dog, 1 day, 1 has, 1 his, 1 every, 1 44
  45. 45. Term Partitioning @ Query Time "dog OR cat" 45
  46. 46. Tradeoffs... Document Partitioning Term Partitioning + Lower Latency Queries - Higher Latency Queries - Lower Throughput + Higher Throughput - Lots of Disk Seeks - Hotspots in Ring (the "Obama" problem) 46
  47. 47. Riak Search: Term Partitioning Term-partitioning is the most viable approach for our beta clients’ needs: high throughput on Really Big Datasets. Optimizations: • Term splitting to reduce hot spots • Bloom filters & caching to save query-time bandwidth • Batching to save query-time & index-time bandwidth Support for either approach eventually. 47
  48. 48. Diving Deeper: The Lifecycle of a Query 48
  49. 49. Parse the Query 49
  50. 50. The Query meeting AND (face OR phone) 50
  51. 51. The Query as an Erlang Term (Parse w/ Leex and Yecc) [{land, [ {term,"meeting",[]}, {lor,[ {term,"face",[]}, {term,"phone",[]} ]} ]}] 51
  52. 52. The Query as a Graph #land #term #lor "meeting" #term #term "phone" "face" 52
  53. 53. Plan the Query 53
  54. 54. System Catalog Catalog Catalog Catalog Catalog Catalog Catalog Catalog Catalog 54
  55. 55. System Catalog Term Weight Term TermID & File Offset 55
  56. 56. Consult the System Catalog for Term/Node Weights #land #term #lor "meeting" #term #term 23 @ node B "phone" "face" 17 @ node A 13 @ node C 56
  57. 57. Use Term Weights to Plan the Query #land #term #lor "meeting" #term #term 23 @ node B "phone" "face" 17 @ node A 13 @ node C 57
  58. 58. Use Term Weights to Plan the Query #land #term #lor "meeting" #term #term 23 @ node B "phone" "face" 17 @ node A 13 @ node C 58
  59. 59. Use Term Weights to Plan the Query #node@B #land #term #node@A "meeting" #lor #term #term 23 @ node B "phone" "face" 13 @ node C 17 @ node A 59
  60. 60. The Node-Assigned Query as an Erlang Term [{node, {land, [ {node, {lor, [ {term,{"email","body","face"}, [ {node_weight,'node_c@127.0.0.1', 13} ]}, {term,{"email","body", "phone"}, [ {node_weight,'node_a@127.0.0.1', 17} ]} ]}, 'node_a@127.0.0.1' }, {term, {"email","body","meeting"}, [ {node_weight,'node_b@127.0.0.1', 23} ]} ]}, 'node_b@127.0.0.1' }] 60
  61. 61. Execute the Query 61
  62. 62. Spawn the Query Processes #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 62
  63. 63. Spawn the Query Processes #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 63
  64. 64. Spawn the Query Processes #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 64
  65. 65. Spawn the Query Processes #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 65
  66. 66. Spawn the Query Processes #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 66
  67. 67. Spawn the Query Processes & Stream the Results #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 67
  68. 68. Spawn the Query Processes & Stream the Results #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 68
  69. 69. Spawn the Query Processes & Stream the Results #node@B #land #term #node@A "meeting" #lor #term #term "phone" "face" 69
  70. 70. Terminate When Finished #node@B #land #term #node@A 'disconnect' #lor #term #term 'disconnect' 'disconnect' 70
  71. 71. Message Format 71
  72. 72. The Message Format Message :: {results, [Result]} | {results, disconnect} Result :: {DocID, Properties} DocID :: term() Properties :: proplist() 72
  73. 73. The Message Format {results, [ {375, []}, {961, [{color, "red"}]}, {155, [{pos, [1,2,5]}]} ]} 73
  74. 74. Yay for Erlang! • Clean lines between load balancing and logic, single- and multi-node look the same • Easy to create new operators, rapid development of experimental features • Linked processes make cleanup a breeze • Significant code reduction over early Java prototypes 74
  75. 75. Part Four Review 75
  76. 76. Riak Search turns this... "Converse AND Shoes" WTF!? I'm a KV store! CLIENT RIAK 76
  77. 77. ...into this... "Converse AND Shoes" Gladly! CLIENT RIAK 77
  78. 78. ...into this... "Converse AND Shoes" Keys or Objects CLIENT RIAK 78
  79. 79. ...while keeping operations easy. Your Riak Riak Application Search 79
  80. 80. Thanks! Questions? Search Team: John Muellerleile - @jrecursive Rusty Klophaus - @rklophaus Kevin Smith - @kevsmith Currently working with a small set of Beta users. Open-source release planned for Q3. www.basho.com

×