Your SlideShare is downloading. ×
0
A Real-Time Search Engine with Lucene and S4Yahoo! S4 applied to Information Retrieval 2/5/2011                           ...
Speaker      @mfiguiere      blog.xebia.fr      Michaël Figuière           Distributed                                 Arch...
Our case study      A Search Engine to keep track of activities                within an enterprise
The Problem
A Search Engine                  Search
A Search Engine            MyCustomer   Search
A Search Engine                      MyCustomer                               Search   Document     Non Disclosure Agreeme...
Indexing Pipeline                    Tika       PDF                  Text                            Analyzer             ...
A more complex Search Engine                      MyCustomer                               Search                    Sales...
Indexing Pipeline            Tika       Mahout PDF             Text                       Classifier   Analyzer           E...
More complex ...• Entity Recognition         Recognizes an entity written in any way• Language Recognition         To inde...
A Real-Time Search Engine                      MyCustomer                               Search                    Sales   ...
A Real-Time Search Engine                      MyCustomer                               Search                    Sales   ...
Indexing Pipeline                                         Since Lucene 2.9 PDF          Text          Some                ...
But... PDF           Text          Some                                         Analyzer         Extractor   Pre-Processin...
Let’s distribute it                 Server 1                    Server 3            Pre-            Search      Pre-      ...
That’s a problem...• Processing and index storage may have different scaling needs        Depending on the search traffic,...
Let’s move it to Hadoop PDF          Text          Some                                     Analyzer        Extractor   Pr...
But...• Hadoop can only deal with chunk of data         Data must be available somewhere on HDFS• Unbounded stream of data...
S4
S4• A distributed, fault-tolerant, stream processing system• Elastic            Based on Zookeeper• Project started in nov...
Where does S4 come from ?• Open Source project created by Yahoo!• Initially built for relevant ad selection and clever pos...
Processing Element                                  Your business                                  logic goes here        ...
Processing Node                        Processing Node           Processing     Processing      Processing           Eleme...
S4 Cluster                                      Cluster                                      Management             Proces...
Programming model                                  PhoneCallPE                               Accept events with :         ...
An indexing pipeline with S4               ReRoutingPE                                                  Handles incoming e...
An indexing pipeline with S4               ReRoutingPE TextExtractionPE              TextExtractionPE                     ...
An indexing pipeline with S4               ReRoutingPE TextExtractionPE              TextExtractionPE               ReRout...
Some drawbacks• The system is lossy         Events may be lost when nodes are overloaded or during failure• A workaround i...
More: Real-Time Inverted Search                      MyCustomer                               Search                    Sa...
Summary• S4 is a nice processing system for real-time search         Events may be lost when nodes are overloaded or durin...
Questions / Answers                       ?                      blog.xebia.fr                      @mfiguiere
Upcoming SlideShare
Loading in...5
×

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

9,637

Published on

Published in: Technology
1 Comment
16 Likes
Statistics
Notes
No Downloads
Views
Total Views
9,637
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
135
Comments
1
Likes
16
Embeds 0
No embeds

No notes for slide

Transcript of "FOSDEM (feb 2011) - A real-time search engine with Lucene and S4"

  1. 1. A Real-Time Search Engine with Lucene and S4Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
  2. 2. Speaker @mfiguiere blog.xebia.fr Michaël Figuière Distributed Architectures NoSQLSearch Engines
  3. 3. Our case study A Search Engine to keep track of activities within an enterprise
  4. 4. The Problem
  5. 5. A Search Engine Search
  6. 6. A Search Engine MyCustomer Search
  7. 7. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  8. 8. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  9. 9. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  10. 10. Indexing Pipeline Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  11. 11. More complex ...• Entity Recognition Recognizes an entity written in any way• Language Recognition To index each language separately• Fetching linked URLs Enhances document context by also indexing linked URLs• ...
  12. 12. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  13. 13. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  14. 14. Indexing Pipeline Since Lucene 2.9 PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some AnalyzerPhone Pre-Processing Call
  15. 15. But... PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some AnalyzerPhone Pre-Processing Call What if it takes one second/document on a single box ??
  16. 16. Let’s distribute it Server 1 Server 3 Pre- Search Pre- Search Processing Index Processing Index Server 2 Server N Pre- Search Processing Index Processing logic and index structure distributed together
  17. 17. That’s a problem...• Processing and index storage may have different scaling needs Depending on the search traffic, the processing overhead, ...• Scaling up and down an index storage is long and complex Whereas stateless processing is simple to scale up/down• Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  18. 18. Let’s move it to Hadoop PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some AnalyzerPhone Pre-Processing Call Hadoop MapReduce
  19. 19. But...• Hadoop can only deal with chunk of data Data must be available somewhere on HDFS• Unbounded stream of data can’t fit into Hadoop MapReduce Hadoop is thought and optimized for batch processing• Manually bounding the stream won’t be efficient It’ll resulting in lot of regular and inefficient batches
  20. 20. S4
  21. 21. S4• A distributed, fault-tolerant, stream processing system• Elastic Based on Zookeeper• Project started in november 2010, still experimental But things are moving fast !
  22. 22. Where does S4 come from ?• Open Source project created by Yahoo!• Initially built for relevant ad selection and clever positioning on webpages But thought to be generic enough• Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  23. 23. Processing Element Your business logic goes here Processing Element Events Input Events Output
  24. 24. Processing Node Processing Node Processing Processing Processing Element 1 Element 2 Element N
  25. 25. S4 Cluster Cluster Management Processing Node 1 Events Processing Node 2 Zookeeper Stream Processing Node N
  26. 26. Programming model PhoneCallPE Accept events with : Type=PhoneCall Event KeyTuple: Id=15497 Event Type: PhoneCall Type: EnrichedPhoneCall KeyTuple: «Id=15497» KeyTuple: «Id=15497» Value: <serialized object> Value: <serialized object> A new Processing Element instance is created for each value of «Id»
  27. 27. An indexing pipeline with S4 ReRoutingPE Handles incoming events and load-balance them according to partitioning TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE MergingPE
  28. 28. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE Handles result events ReRoutingPE and load-balance between Processing Nodes ClassificationPE ClassificationPE MergingPE
  29. 29. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE Handles final result events and push MergingPE them to the Indexer
  30. 30. Some drawbacks• The system is lossy Events may be lost when nodes are overloaded or during failure• A workaround is to increase the incoming queue of nodes But still, events may be lost during failure• Still experimental But very promising
  31. 31. More: Real-Time Inverted Search MyCustomer Search Sales Juridic Accounting 20 new results... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  32. 32. Summary• S4 is a nice processing system for real-time search Events may be lost when nodes are overloaded or during failure• Not only for indexing-time, also for query-time ! As S4 ensures low latency, query-time processing is possible• A promising roadmap.... Better failure handling, client API in major languages, initial processing with Hadoop, ...
  33. 33. Questions / Answers ? blog.xebia.fr @mfiguiere
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×