FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

10,133 views

Published on

Published in: Technology
1 Comment
15 Likes
Statistics
Notes
No Downloads
Views
Total views
10,133
On SlideShare
0
From Embeds
0
Number of Embeds
33
Actions
Shares
0
Downloads
142
Comments
1
Likes
15
Embeds 0
No embeds

No notes for slide

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

  1. 1. A Real-Time Search Engine with Lucene and S4Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
  2. 2. Speaker @mfiguiere blog.xebia.fr Michaël Figuière Distributed Architectures NoSQLSearch Engines
  3. 3. Our case study A Search Engine to keep track of activities within an enterprise
  4. 4. The Problem
  5. 5. A Search Engine Search
  6. 6. A Search Engine MyCustomer Search
  7. 7. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  8. 8. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  9. 9. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  10. 10. Indexing Pipeline Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  11. 11. More complex ...• Entity Recognition Recognizes an entity written in any way• Language Recognition To index each language separately• Fetching linked URLs Enhances document context by also indexing linked URLs• ...
  12. 12. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  13. 13. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  14. 14. Indexing Pipeline Since Lucene 2.9 PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some AnalyzerPhone Pre-Processing Call
  15. 15. But... PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some AnalyzerPhone Pre-Processing Call What if it takes one second/document on a single box ??
  16. 16. Let’s distribute it Server 1 Server 3 Pre- Search Pre- Search Processing Index Processing Index Server 2 Server N Pre- Search Processing Index Processing logic and index structure distributed together
  17. 17. That’s a problem...• Processing and index storage may have different scaling needs Depending on the search traffic, the processing overhead, ...• Scaling up and down an index storage is long and complex Whereas stateless processing is simple to scale up/down• Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  18. 18. Let’s move it to Hadoop PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some AnalyzerPhone Pre-Processing Call Hadoop MapReduce
  19. 19. But...• Hadoop can only deal with chunk of data Data must be available somewhere on HDFS• Unbounded stream of data can’t fit into Hadoop MapReduce Hadoop is thought and optimized for batch processing• Manually bounding the stream won’t be efficient It’ll resulting in lot of regular and inefficient batches
  20. 20. S4
  21. 21. S4• A distributed, fault-tolerant, stream processing system• Elastic Based on Zookeeper• Project started in november 2010, still experimental But things are moving fast !
  22. 22. Where does S4 come from ?• Open Source project created by Yahoo!• Initially built for relevant ad selection and clever positioning on webpages But thought to be generic enough• Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  23. 23. Processing Element Your business logic goes here Processing Element Events Input Events Output
  24. 24. Processing Node Processing Node Processing Processing Processing Element 1 Element 2 Element N
  25. 25. S4 Cluster Cluster Management Processing Node 1 Events Processing Node 2 Zookeeper Stream Processing Node N
  26. 26. Programming model PhoneCallPE Accept events with : Type=PhoneCall Event KeyTuple: Id=15497 Event Type: PhoneCall Type: EnrichedPhoneCall KeyTuple: «Id=15497» KeyTuple: «Id=15497» Value: <serialized object> Value: <serialized object> A new Processing Element instance is created for each value of «Id»
  27. 27. An indexing pipeline with S4 ReRoutingPE Handles incoming events and load-balance them according to partitioning TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE MergingPE
  28. 28. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE Handles result events ReRoutingPE and load-balance between Processing Nodes ClassificationPE ClassificationPE MergingPE
  29. 29. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE Handles final result events and push MergingPE them to the Indexer
  30. 30. Some drawbacks• The system is lossy Events may be lost when nodes are overloaded or during failure• A workaround is to increase the incoming queue of nodes But still, events may be lost during failure• Still experimental But very promising
  31. 31. More: Real-Time Inverted Search MyCustomer Search Sales Juridic Accounting 20 new results... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  32. 32. Summary• S4 is a nice processing system for real-time search Events may be lost when nodes are overloaded or during failure• Not only for indexing-time, also for query-time ! As S4 ensures low latency, query-time processing is possible• A promising roadmap.... Better failure handling, client API in major languages, initial processing with Hadoop, ...
  33. 33. Questions / Answers ? blog.xebia.fr @mfiguiere

×