SlideShare a Scribd company logo
A Real-Time Search Engine with Lucene and S4
Yahoo! S4 applied to Information Retrieval




 2/5/2011                                    Michaël Figuière
Speaker

      @mfiguiere
      blog.xebia.fr



      Michaël Figuière           Distributed
                                 Architectures

                         NoSQL
Search Engines
Our case study




      A Search Engine to keep track of activities
                within an enterprise
The Problem
A Search Engine


                  Search
A Search Engine


            MyCustomer   Search
A Search Engine


                      MyCustomer                               Search




   Document     Non Disclosure Agreement                                          12 days ago
                   ... MyCustomer agrees not to disclose any part of ...



   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



                    Tika


       PDF
                  Text
                            Analyzer
                Extractor
                                                Search
                                                 Index
                            Analyzer
      Phone
       Call



                                       Lucene
A more complex Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



            Tika       Mahout


 PDF
             Text
                       Classifier   Analyzer
           Extractor
                                                       Search
                                                        Index
                       Classifier   Analyzer
 Phone
  Call



                                              Lucene
More complex ...

• Entity Recognition
         Recognizes an entity written in any way



• Language Recognition
         To index each language separately



• Fetching linked URLs
         Enhances document context by also indexing linked URLs



• ...
A Real-Time Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
A Real-Time Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline


                                         Since Lucene 2.9



 PDF
          Text          Some
                                     Analyzer
        Extractor   Pre-Processing
                                                  Near Real-Time
                                                   Search Index
                        Some
                                     Analyzer
Phone               Pre-Processing
 Call
But...




 PDF
           Text          Some
                                         Analyzer
         Extractor   Pre-Processing
                                                    Near Real-Time
                                                     Search Index
                         Some
                                         Analyzer
Phone                Pre-Processing
 Call


                                      What if it takes
                                      one second/document
                                      on a single box ??
Let’s distribute it

                 Server 1                    Server 3

            Pre-            Search      Pre-            Search
         Processing          Index   Processing          Index



                 Server 2                    Server N

            Pre-            Search
         Processing          Index




  Processing logic and index structure distributed together
That’s a problem...

• Processing and index storage may have different scaling needs
        Depending on the search traffic, the processing overhead, ...



• Scaling up and down an index storage is long and complex
        Whereas stateless processing is simple to scale up/down



• Expensive pre-processing may make searches slower
        And indexing in real-time shouldn’t make searches slower !
Let’s move it to Hadoop




 PDF
          Text          Some
                                     Analyzer
        Extractor   Pre-Processing
                                                Near Real-Time
                                                 Search Index
                        Some
                                     Analyzer
Phone               Pre-Processing
 Call




                                     Hadoop MapReduce
But...

• Hadoop can only deal with chunk of data
         Data must be available somewhere on HDFS



• Unbounded stream of data can’t fit into Hadoop MapReduce
         Hadoop is thought and optimized for batch processing



• Manually bounding the stream won’t be efficient
         It’ll resulting in lot of regular and inefficient batches
S4
S4

• A distributed, fault-tolerant, stream processing system



• Elastic

            Based on Zookeeper



• Project started in november 2010, still experimental

            But things are moving fast !
Where does S4 come from ?

• Open Source project created by Yahoo!




• Initially built for relevant ad selection and clever positioning on webpages
         But thought to be generic enough



• Expensive pre-processing may make searches slower
         And indexing in real-time shouldn’t make searches slower !
Processing Element

                                  Your business
                                  logic goes here

                     Processing
                      Element



      Events Input                Events Output
Processing Node




                        Processing Node

           Processing     Processing      Processing
           Element 1      Element 2       Element N
S4 Cluster


                                      Cluster
                                      Management
             Processing Node 1


   Events
             Processing Node 2   Zookeeper
   Stream


             Processing Node N
Programming model


                                  PhoneCallPE

                               Accept events with :
                                 Type=PhoneCall
           Event               KeyTuple: Id=15497              Event
      Type: PhoneCall                                 Type: EnrichedPhoneCall

   KeyTuple: «Id=15497»                                KeyTuple: «Id=15497»

  Value: <serialized object>                          Value: <serialized object>


                                                A new Processing
                                                Element instance is created
                                                for each value of «Id»
An indexing pipeline with S4

               ReRoutingPE
                                                  Handles incoming events
                                                  and load-balance them
                                                  according to partitioning
 TextExtractionPE              TextExtractionPE




               ReRoutingPE




 ClassificationPE               ClassificationPE




                   MergingPE
An indexing pipeline with S4

               ReRoutingPE




 TextExtractionPE              TextExtractionPE


                                                  Handles result events
               ReRoutingPE                        and load-balance between
                                                  Processing Nodes

 ClassificationPE               ClassificationPE




                   MergingPE
An indexing pipeline with S4

               ReRoutingPE




 TextExtractionPE              TextExtractionPE




               ReRoutingPE




 ClassificationPE               ClassificationPE

                                                  Handles final result
                                                  events and push
                   MergingPE
                                                  them to the Indexer
Some drawbacks

• The system is lossy
         Events may be lost when nodes are overloaded or during failure



• A workaround is to increase the incoming queue of nodes
         But still, events may be lost during failure



• Still experimental
         But very promising
More: Real-Time Inverted Search


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting


                                     20 new results...


   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Summary

• S4 is a nice processing system for real-time search
         Events may be lost when nodes are overloaded or during failure



• Not only for indexing-time, also for query-time !
         As S4 ensures low latency, query-time processing is possible



• A promising roadmap....
         Better failure handling, client API in major languages,
         initial processing with Hadoop, ...
Questions / Answers




                       ?
                      blog.xebia.fr
                      @mfiguiere

More Related Content

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick StartFishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick StartKim Negaard
 
Implementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessImplementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessDataWorks Summit
 
Nosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesNosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use Cases
MongoDB
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
Inside Analysis
 
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
HostedbyConfluent
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
 
IPexcel - Company Overview
IPexcel - Company OverviewIPexcel - Company Overview
IPexcel - Company Overview
IPexcel
 
How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?
Denodo
 
Microsoft StreamInsight
Microsoft StreamInsight Microsoft StreamInsight
Microsoft StreamInsight
Mark Ginnebaugh
 
ABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
C4Media
 
Digital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just TechnologyDigital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just Technology
confluent
 
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
InfluxData
 
Albel Pres Continuous Intelligence Overview
Albel Pres   Continuous Intelligence OverviewAlbel Pres   Continuous Intelligence Overview
Albel Pres Continuous Intelligence Overview
Ali BELCAID
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Denodo
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
Trieu Nguyen
 
Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...
sharedserviceslink.com
 
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Looker
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
Sergei Dolukhanov
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
Sergei Dolukhanov
 

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4 (20)

Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick StartFishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
 
Implementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessImplementing Big Data at the Speed of Business
Implementing Big Data at the Speed of Business
 
Nosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesNosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use Cases
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
IPexcel - Company Overview
IPexcel - Company OverviewIPexcel - Company Overview
IPexcel - Company Overview
 
How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?
 
Microsoft StreamInsight
Microsoft StreamInsight Microsoft StreamInsight
Microsoft StreamInsight
 
ABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY USA TAWPI presentation
ABBYY USA TAWPI presentation
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Digital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just TechnologyDigital Transformation Mindset - More Than Just Technology
Digital Transformation Mindset - More Than Just Technology
 
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
 
Albel Pres Continuous Intelligence Overview
Albel Pres   Continuous Intelligence OverviewAlbel Pres   Continuous Intelligence Overview
Albel Pres Continuous Intelligence Overview
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...Global automation domination: how do you roll out one workflow solution acros...
Global automation domination: how do you roll out one workflow solution acros...
 
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
 

More from Michaël Figuière

EclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraEclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache Cassandra
Michaël Figuière
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersMichaël Figuière
 
YaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersYaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersMichaël Figuière
 
Geneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersGeneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersMichaël Figuière
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Michaël Figuière
 
NYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthNYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthMichaël Figuière
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversMichaël Figuière
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
GTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLGTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLMichaël Figuière
 
Duchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutDuchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutMichaël Figuière
 
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...Michaël Figuière
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraMichaël Figuière
 
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache MahoutMix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache MahoutMichaël Figuière
 
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesBreizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesMichaël Figuière
 
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web DevelopmentXebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web DevelopmentMichaël Figuière
 
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Michaël Figuière
 
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesLorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesMichaël Figuière
 
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesTours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesMichaël Figuière
 

More from Michaël Figuière (20)

EclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraEclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache Cassandra
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
YaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersYaJug - Cassandra for Java Developers
YaJug - Cassandra for Java Developers
 
Geneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersGeneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java Developers
 
ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
NYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthNYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in Depth
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra Drivers
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
 
GTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLGTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQL
 
Duchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutDuchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache Mahout
 
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
 
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache MahoutMix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
 
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesBreizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
 
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web DevelopmentXebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
 
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
 
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesLorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
 
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesTours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

  • 1. A Real-Time Search Engine with Lucene and S4 Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
  • 2. Speaker @mfiguiere blog.xebia.fr Michaël Figuière Distributed Architectures NoSQL Search Engines
  • 3. Our case study A Search Engine to keep track of activities within an enterprise
  • 6. A Search Engine MyCustomer Search
  • 7. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 8. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  • 9. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 10. Indexing Pipeline Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  • 11. More complex ... • Entity Recognition Recognizes an entity written in any way • Language Recognition To index each language separately • Fetching linked URLs Enhances document context by also indexing linked URLs • ...
  • 12. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 13. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 14. Indexing Pipeline Since Lucene 2.9 PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call
  • 15. But... PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call What if it takes one second/document on a single box ??
  • 16. Let’s distribute it Server 1 Server 3 Pre- Search Pre- Search Processing Index Processing Index Server 2 Server N Pre- Search Processing Index Processing logic and index structure distributed together
  • 17. That’s a problem... • Processing and index storage may have different scaling needs Depending on the search traffic, the processing overhead, ... • Scaling up and down an index storage is long and complex Whereas stateless processing is simple to scale up/down • Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  • 18. Let’s move it to Hadoop PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call Hadoop MapReduce
  • 19. But... • Hadoop can only deal with chunk of data Data must be available somewhere on HDFS • Unbounded stream of data can’t fit into Hadoop MapReduce Hadoop is thought and optimized for batch processing • Manually bounding the stream won’t be efficient It’ll resulting in lot of regular and inefficient batches
  • 20. S4
  • 21. S4 • A distributed, fault-tolerant, stream processing system • Elastic Based on Zookeeper • Project started in november 2010, still experimental But things are moving fast !
  • 22. Where does S4 come from ? • Open Source project created by Yahoo! • Initially built for relevant ad selection and clever positioning on webpages But thought to be generic enough • Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  • 23. Processing Element Your business logic goes here Processing Element Events Input Events Output
  • 24. Processing Node Processing Node Processing Processing Processing Element 1 Element 2 Element N
  • 25. S4 Cluster Cluster Management Processing Node 1 Events Processing Node 2 Zookeeper Stream Processing Node N
  • 26. Programming model PhoneCallPE Accept events with : Type=PhoneCall Event KeyTuple: Id=15497 Event Type: PhoneCall Type: EnrichedPhoneCall KeyTuple: «Id=15497» KeyTuple: «Id=15497» Value: <serialized object> Value: <serialized object> A new Processing Element instance is created for each value of «Id»
  • 27. An indexing pipeline with S4 ReRoutingPE Handles incoming events and load-balance them according to partitioning TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE MergingPE
  • 28. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE Handles result events ReRoutingPE and load-balance between Processing Nodes ClassificationPE ClassificationPE MergingPE
  • 29. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE Handles final result events and push MergingPE them to the Indexer
  • 30. Some drawbacks • The system is lossy Events may be lost when nodes are overloaded or during failure • A workaround is to increase the incoming queue of nodes But still, events may be lost during failure • Still experimental But very promising
  • 31. More: Real-Time Inverted Search MyCustomer Search Sales Juridic Accounting 20 new results... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 32. Summary • S4 is a nice processing system for real-time search Events may be lost when nodes are overloaded or during failure • Not only for indexing-time, also for query-time ! As S4 ensures low latency, query-time processing is possible • A promising roadmap.... Better failure handling, client API in major languages, initial processing with Hadoop, ...
  • 33. Questions / Answers ? blog.xebia.fr @mfiguiere