Lucene revolution 2011


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lucene revolution 2011

  1. 1. Solr Cluster installation tool "Anuenue" and "Did You Mean?" for Japanese Takahiko Ito mixi, Inc. 1
  2. 2. mixi?£ One of the largest social networking service in Japan.£ Many services to promote communication among users. ¢ Blog, news, game platform etc ¢ Most of the services come with search£ 15M monthly active users 2
  3. 3. Our current (urgent) project …Replace in-house search engines into a up-to-date searchplatform! We have ¢  selected Apache Solr as the search platform! ¢  created a simple OSS package (Anuenue) which wraps SolrProject URL: 3
  4. 4. Reason why we make AnuenueDeployment / daily operations of Solr search cluster is a bitdifficult for ordinary engineers. ¢ We need to edit the configuration files for all the Solr instances respectively ¢ Commands for whole clusters are not provided •  We need to write client commands by ourselves •  Hadoop provides utility commands for clusters E.g., (start processes), fsck (check all discs), balancer (rebalance the data blocks)
  5. 5. What does Anuenue provide?£ Handy configuration of search clusters£ Commands for clusters ¢ Simple commands (post, delete, update, commit etc) ¢ Start and stop commands for processes in cluster.£ Japanese support ¢ Implementation of Japanese Did-You-Mean facilities ¢ Japanese tokenizer (Sen and Kuromoji) 5
  6. 6. Today’s Topics£ Anuenue ¢ Handy configuration of search clusters ¢ Commands for search clusters£ Did-You-Mean facilities for Japanese queries ¢ Common problem in Did-You-Mean implementation ¢ Mining a Japanese Did-You-Mean dictionary from query log data 6
  7. 7. Cluster configuration with Anuenue£  Cluster setup is done with a special configuration file£  Anuenue assigns more than one roles to instances. ¢  Roles are the functions in a cluster ¢  Anuenue supports three roles (Master, Slave, Merger) 7
  8. 8. Role: master£ Index input data.NOTE: Anuenue provides a command to distribute the inputdata into master instances (build Solr shard indexes) . Master-1 Master-2 Master-3 Build shard indexes Input Data 8
  9. 9. Role: slaveHas three functions Merger-1 ¢ Copy (replicate) index from master Submit queries ¢ Accept queries from mergers and then Slave-1 Slave-2 search it own index Replicate index ¢ Return the results to merger instance Master-1 Master-2 Index input data Input Data 9
  10. 10. Role: merger£  Forwards queries from clients to slaves. Client-1 Client-2 ¢  Note: clients need not to know the slave Submit queries instances (merger adds ‘shard’ Merger parameter with slave Forwards queries instances)£  Merge the results from all Slave-1 Slave-2 the slave instances and returned the merged results. 10
  11. 11. Example: Anuenue clusterThe cluster consists of five Client-1 Client-2machines ¢ Each has one aa Anuenue instance Forward queriesInstances cc dd ¢ Merger: aa Replicate index ¢ Master: bb, cc bb ee ¢ Slave: dd, ee Index input data Input Data 11
  12. 12. How to assign roles to instance?Edit cluster configuration file, anuenue-nodes.xml. •  Add three elements (mergers, slaves and masters) •  In each element, add more than one instance information (machine name and port number). 12
  13. 13. Configuration exampleCase: there is one merger instance in machine, aa (port7000)<mergers> <merger> <host>aa</host> <port>7000</port> </merger></mergers> 13
  14. 14. Specify the index to replicate<masters> <master iname=“master1”> <host>aaaa</host> <port>8983</port> </master> Add name of master instance</masters> by iname attribute<slaves> <slave > <host>bbbb</host> <port>8983</port> <replicate>master1</replicate> </slave> Specify the master instance</slaves> to copy the index adding replicate element 14
  15. 15. Example: simple cluster settings <mergers> Client-1 Client-2 <merger> <host>aa</host> <port>8983</port> </merger> aa </mergers> <masters> Forward queries <master iname=“master1”> <host>bb</host> cc <port>8983</port> </master> Replicate index </masters> <slaves> bb <slave> <host>cc</host> Index input data <port>8983</port> <replicate>master1</replicate> </slave> Input Data </slaves> 15
  16. 16. Cluster setup with Anuenue£ Flexible and support various types of search cluster.£ For example… 16
  17. 17. Assign multiple roles Client1 Client2 Submit queries instance Index input data Input Data 17
  18. 18. Large clusters to handle huge data withhigh QPS Client1 Client2 Client3 … ClientN Merger1 Merger2 Merger3 Slave1 Slave2 Slave3 Slave4 Slave5 Slave6 Master1 Master2 Master3 Master4 Master5 Master6 Input Data 18
  19. 19. After setting up cluster We can make use of commands for clusters. Anuenue provides ¢  start / stop commands ¢  commands to manipulate the index
  20. 20. Start and stop clustersUsers can start / stop clusters by a command( $sh bin/ [start|stop]
  21. 21. Simple commands for clusters Anuenue also provides basic commands ( post’, ‘delete’,‘commit’, ‘optimize’ and ‘update’) for search cluster ¢ The commands are implemented in multi-threadE.g., $sh bin/ post -arg inputDir
  22. 22. Today’s Topics£ Anuenue ¢ Handy cluster configuration of search clusters ¢ Commands for search clusters£ Did-You-Mean facilities for Japanese queries ¢ Common problem in Did-You-Mean implementation ¢ Mining a Japanese Did-You-Mean dictionary from query log data 22
  23. 23. What is Did-You-Mean service?£ Suggest correct spelling when users submit queries with mistakes£ Increase the usability of search service 23
  24. 24. Example: Did-You-Mean service (English: Ugly Betty) 24
  25. 25. Common implementationMany search engines (including Solr) apply distancemeasures such as Edit Distance [Levenshtein, 1965]Edit Distance: measure of distance between two sequences.Simply speaking, when two sequences have more commoncharacters, the distance is smaller. E.g., like 1 likes (small distance) like 1 foobar (large distance) 25
  26. 26. Common procedure: Did-You-MeanWhen a user submits a query,1.  Did-You-Mean service computes edit distance between input query and words in index.2.  If there is a word whose distance is small, è  Did-You-Mean handler suggestsE.g., when a user submit a query, “pthon”, Did-You-Meanservice suggests a word in the index with small distance“python”. 26
  27. 27. Problem: Japanese queriesSimple application of edit distance does not work forJapaneseè Misspelled queries are sometimes totally different from the correct one (large distance). E.g., ¢  (correct: ) ¢  (correct: )è These cases are derived from Japanese input method. 27
  28. 28. Typing in Japanese queryWe input Japanese (query) words with two steps. 1.  Type the reading of the Japanese word in Latin alphabet. 2.  Select a desired word from the list of candidates This step cause a spelling mistake, too large distance to correct spelling 28
  29. 29. Example: Typing in Japanese queriesAssume a user wants to submit a query: (Obama)1.  Type in the reading in Latin alphabet. reading: obama2.  Select correct spelling. Possible candidates: (correct), , etc. 29
  30. 30. Japanese Did-You-Mean dictionary£  Because of the large distance problem, simple distance measures (edit distance) do not work.£  To handle this problem, Anuenue supports a special dictionary for Japanese Did-You-Mean service. 30
  31. 31. Dictionary for Japanese Did-You-Mean serviceDictionary has two columns Query with Correct Query 1. Query with mistakes mistakes 2. Correct queries 31
  32. 32. Implementing Did-You-Mean service with the dictionaryWhen users submit the Query with Correct Queryquery with mistakes in mistakesdictionary, è  Did-You-Mean service suggests the correct query NOTE: Anuenue provides handlers for the dictionaryformat. 32
  33. 33. Problem…How we can create the dictionary?è We can make use of a query log mining tool Oluolu. 33
  34. 34. Oluolu£ Creates a spelling correction dictionary from query log£ Extracts pairs of queries (query with spelling mistakes, query with correct spelling) ¢ Support the Japanese spelling mistakes (from version 0.2)£ runs on the Hadoop frameworkProject URL: 34
  35. 35. Input to Oluolu: query logThree columns User Id Query Time 1.  User Id 2.  Query string 438904 Pthon 2009-11-21 3.  Time of query 11:16:12 submission 34443 Java 2009-11-21 12:16:13 438904 Python 2009-11-21 12:16:20 8975 Java 2009-11-21 Tomcat 12:16:25 35
  36. 36. Procedure: creating Japanese Did-You- Mean dictionary with OluoluOluolu extracts the elements of Japanese Did-You-Meandictionary with 2 steps. 1.  Extract all the query pairs in the same session 2.  Validate the query pairs 36
  37. 37. Step1: extract query pairs£ Oluolu extracts pairs of User ID Query Time queries in the same session. E.g., Oluolu extracts pair 438904 Pthon 2009-11-21 12:16:12 (Pthon and Python). 34443 Java 2009-11-21 12:16:13£ Queries in the same session: a set of queries submit by the 438904 Python 2009-11-21 12:16:20 same user within small time range. 8975 Tomcat 2009-11-21 12:16:25£ Extracted pairs can be misspelled query and correct query. 37
  38. 38. Step 2: validate candidate pairs£ Oluolu validates all the query pairs extracted step 1.£ In validation phase (step 2), Oluolu makes use of query readings. 38
  39. 39. Reading of Japanese words£ Japanese words can be convert into the readings in Latin Alphabets. ¢  (reading: konnichiha) ¢  (reading: itou)FACT: even when Japanese query with spelling mistakescan be totally different from correct query, è  the readings are the same or the distance is small! 39
  40. 40. Validate candidate pair with readingGiven a query pairs, Oluolu validates the queries with 2steps 1. Convert the queries into readings with Latin Alphabets 2. Compute edit distance with the two readings è  When the distance is small, the two queries are extracted as a element of Did-You-Mean dictionary. 40
  41. 41. Example: step 2Given a pair of queries: ( , )1.  Convert them into readings è  readings are the same, “sumitomofudousan”.3.  Compute the distance with the readings è  Distance is zero è  Extracted as a element of Did-You-Mean dictionary 41
  42. 42. Creating Japanese Did-You-Mean dictionary with Oluolu£ Installation requirements ¢ Java 1.6.0 or greater ¢ Hadoop 0.20.0 or greater ¢ Oluolu 0.2.0 or greater£ Copy the input query log into HDFS£ Run spellcheck task of oluolu $ bin/oluolu spellcheck -input testInput.txt -output output -inputLanguage ja 42
  43. 43. Preliminary experiments £ Experimental settings ¢ Input data: log file from a mixi service (community search). •  5 GB data£ Extracted dictionary ¢  number of elements is over 100.000 ¢  succeeded to extract the query pairs with large edit distance. •  ( Ν, ) •  ( , )
  44. 44. Current status£ Finished functional tests and stress tests.£ Now replacing an in-house search engine in a small search service with Anuenue.£ In next phase, we will apply Anuenue to the search service with large data and high QPS. 44
  45. 45. Future work£ Integrate SolrCloud and Zookeeper ¢ Support failover, and rebalance the index£ Kuromoji, a new OSS Japanese tokenizer 45
  46. 46. Summary£ Introduction of Anuenue£ Described a Did-You-Mean facility for Japanese query 46
  47. 47. Thank you for your attention! 47