Your SlideShare is downloading. ×
Implementation of "Did you mean" Facility for Queries in Japanese - By Takahiko Ito
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Implementation of "Did you mean" Facility for Queries in Japanese - By Takahiko Ito

3,418
views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011 …

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

mixi is one of the largest social networking services in Japan, providing various communication
services for over 14M monthly active users. The latest internal mixi project is to replace the in-house
search engine with Apache Solr. This session covers two topics
a simple packaging system for Solr that eases the installation process and daily operations, and
implementation of a “Did you mean” facility for Japanese queries using a log mining tool. These
tools have been released as OSS projects.

Published in: Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,418
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
16
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Solr Cluster installation tool "Anuenue" and "Did You Mean?" for Japanese Takahiko Ito mixi, Inc.
  • 2. mixi?
    • One of the largest social networking service in Japan.
    • Many services to promote communication among users.
      • Blog, news, game platform etc
      • Most of the services come with search
    • 15M monthly active users
  • 3. Our current (urgent) project …
    • Replace in-house search engines into a up-to-date search
    • platform!
    • We have
      • selected Apache Solr as the search platform!
      • created a simple OSS package ( Anuenue ) which wraps Solr
    • Project URL: http://code.google.com/p/anuenue-wrapper/
  • 4. Reason why we make Anuenue
    • Deployment / daily operations of Solr search cluster is a bit
    • difficult for ordinary engineers.
      • We need to edit the configuration files for all the Solr instances respectively
      • Commands for whole clusters are not provided
        • We need to write client commands by ourselves
        • Hadoop provides utility commands for clusters
        • E.g., start-all.sh (start processes),  fsck (check all discs), balancer (rebalance the data blocks)
  • 5. What does Anuenue provide?
    • Handy configuration of search clusters
    • Commands for clusters
      • Simple commands (post, delete, update, commit etc)
      • Start and stop commands for processes in cluster.
    • Japanese support
      • Implementation of Japanese Did-You-Mean facilities
      • Japanese tokenizer (Sen and Kuromoji)
  • 6. Today’s Topics
    • Anuenue
      • Handy configuration of search clusters
      • Commands for search clusters
    • Did-You-Mean facilities for Japanese queries
      • Common problem in Did-You-Mean implementation
      • Mining a Japanese Did-You-Mean dictionary from query log data
  • 7. Cluster configuration with Anuenue
    • Cluster setup is done with a special configuration file
    • Anuenue assigns more than one roles to instances.
      • Roles are the functions in a cluster
      • Anuenue supports three roles ( Master, Slave, Merger )
  • 8. Role: master
    • Index input data.
    • NOTE : Anuenue provides a command to distribute the input
    • data into master instances (build Solr shard indexes) .
    Input Data Master-1 Master-2 Master-3 Build shard indexes
  • 9. Role: slave
    • Has three functions
      • Copy (replicate) index from master
      • Accept queries from mergers and then search it own index
      • Return the results to merger instance
    Input Data Slave-1 Slave-2 Merger-1 Submit queries Replicate index Master-1 Master-2 Index input data
  • 10. Role: merger
    • Forwards queries from clients to slaves.
      • Note: clients need not to know the slave instances (merger adds ‘shard’ parameter with slave instances)
    • Merge the results from all the slave instances and returned the merged results.
    Slave-1 Slave-2 Merger Forwards queries Client-1 Client-2 Submit queries
  • 11. Example: Anuenue cluster
    • The cluster consists of five
    • machines
      • Each has one Anuenue instance
    • Instances
      • Merger: aa
      • Master: bb, cc
      • Slave: dd, ee
    Input Data bb ee cc dd aa Forward queries Index input data Client-1 Client-2 Replicate index
  • 12. How to assign roles to instance?
    • Edit cluster configuration file, anuenue-nodes.xml .
        • Add three elements (mergers, slaves and masters)
        • In each element, add more than one instance information ( machine name and port number ).
  • 13. Configuration example
    • Case: there is one merger instance in machine, aa (port
    • 7000)
    • <mergers>
    • <merger>
    • < host >aa</ host >
    • < port >7000</ port > </merger>
    • </mergers>
  • 14. Specify the index to replicate
    • <masters>
    • <master iname =“ master1 ”>
    • <host>aaaa</host>
    • <port>8983</port>
    • </master>
    • </masters>
    • <slaves> <slave > <host>bbbb</host> <port>8983</port>
    • < replicate > master1 </ replicate > </slave>
    • </slaves>
    Add name of master instance by iname attribute Specify the master instance to copy the index adding replicate element
  • 15. Example: simple cluster settings Input Data bb cc aa Forward queries Index input data Client-1 Client-2 <mergers> <merger> <host>aa</host> <port>8983</port> </merger> </mergers> <masters> <master iname=“master1”> <host>bb</host> <port>8983</port> </master> </masters> <slaves> <slave> <host>cc</host> <port>8983</port> <replicate>master1</replicate> </slave> </slaves> Replicate index
  • 16. Cluster setup with Anuenue
    • Flexible and support various types of search cluster.
    • For example…
  • 17. Assign multiple roles Input Data instance Client1 Client2 Index input data Submit queries
  • 18. Large clusters to handle huge data with high QPS Input Data Client1 Slave1 Client2 Merger1 Slave3 Slave2 Slave4 Master1 Master2 Slave5 Slave6 Master3 Master4 Master5 Master6 Merger2 Merger3 Client3 ClientN …
  • 19. After setting up cluster
    • We can make use of commands for clusters.
    • Anuenue provides
      • start / stop commands
      • commands to manipulate the index
  • 20. Start and stop clusters
    • Users can start / stop clusters by a command
    • (anuenue-distdaemon.sh).
    • Usage:
    • $sh bin/anuenue-distdaemon.sh [start|stop]
  • 21. Simple commands for clusters
    • Anuenue also provides basic commands (‘ post ’, ‘ delete ’,
    • ‘ commit ’, ‘ optimize ’ and ‘ update ’) for search cluster  
      • The commands are implemented in multi-thread
    • E.g.,
    • $sh bin/anuenue-distcommands.sh post -arg inputDir
  • 22. Today’s Topics
    • Anuenue
      • Handy cluster configuration of search clusters
      • Commands for search clusters
    • Did-You-Mean facilities for Japanese queries
      • Common problem in Did-You-Mean implementation
      • Mining a Japanese Did-You-Mean dictionary from query log data
  • 23. What is Did-You-Mean service?
    • Suggest correct spelling when users submit queries with mistakes
    • Increase the usability of search service
  • 24. Example: Did-You-Mean service (English: Ugly Betty)
  • 25. Common implementation
    • Many search engines (including Solr) apply distance
    • measures such as Edit Distance [Levenshtein, 1965]
    • Edit Distance : measure of distance between two sequences.
    • Simply speaking, when two sequences have more common
    • characters, the distance is smaller.
    • E.g.,
    • like  likes (small distance)
    • like  foobar (large distance)
  • 26. Common procedure: Did-You-Mean
    • When a user submits a query,
    • Did-You-Mean service computes edit distance between input query and words in index.
    • If there is a word whose distance is small,
        • Did-You-Mean handler suggests
    • E.g., when a user submit a query, “pthon”, Did-You-Mean
    • service suggests a word in the index with small distance
    • “ python”.
  • 27. Problem: Japanese queries
    • Simple application of edit distance does not work for
    • Japanese
    • Misspelled queries are sometimes totally different from the correct one ( large distance ).
      • E.g.,
      • 墨ともふどうさん (correct: 住友不動産 )
      • 米事案セット (correct: ベイジアンセット )
    • These cases are derived from Japanese input method .
  • 28. Typing in Japanese query
    • We input Japanese (query) words with two steps.
      • Type the reading of the Japanese word in Latin alphabet.
      • Select a desired word from the list of candidates
    This step cause a spelling mistake, too large distance to correct spelling
  • 29. Example: Typing in Japanese queries
    • Assume a user wants to submit a query:
    • オバマ (Obama)
    • Type in the reading in Latin alphabet.
    • reading : obama
    • Select correct spelling.
    • Possible candidates : オバマ ( correct ), おばま , 小浜 etc.
  • 30. Japanese Did-You-Mean dictionary
    • Because of the large distance problem, simple distance measures (edit distance) do not work.
    • To handle this problem, Anuenue supports a special dictionary for Japanese Did-You-Mean service.
  • 31. Dictionary for Japanese Did-You-Mean service
    • Dictionary has two columns
      • Query with mistakes
      • Correct queries
    Query with mistakes Correct Query 墨ともふどうさん 住友不動産 歌だ光る 宇多田ヒカル 米事案セット ベイジアンセット
  • 32. Implementing Did-You-Mean service with the dictionary
    • When users submit the
    • query with mistakes in
    • dictionary,
    • Did-You-Mean service suggests the correct query
    • NOTE : Anuenue provides
    • handlers for the dictionary
    • format.
    Query with mistakes Correct Query 墨ともふどうさん 住友不動産 歌だ光る 宇多田ヒカル 米事案セット ベイジアンセット
  • 33. Problem…
    • How we can create the dictionary?
    • We can make use of a query log mining tool Oluolu .
  • 34. Oluolu
    • Creates a spelling correction dictionary from query log
    • Extracts pairs of queries (query with spelling mistakes, query with correct spelling)
      • Support the Japanese spelling mistakes (from version 0.2)
    • runs on the Hadoop framework
    • Project URL: http://code.google.com/p/oluolu/
  • 35. Input to Oluolu: query log
    • Three columns
      • User Id
      • Query string
      • Time of query submission
    User Id Query Time 438904 Pthon 2009-11-21 11:16:12 34443 Java 2009-11-21 12:16:13 438904 Python 2009-11-21 12:16:20 8975 Java Tomcat 2009-11-21 12:16:25
  • 36. Procedure: creating Japanese Did-You-Mean dictionary with Oluolu
    • Oluolu extracts the elements of Japanese Did-You-Mean
    • dictionary with 2 steps.
      • Extract all the query pairs in the same session
      • Validate the query pairs
  • 37. Step1: extract query pairs
    • Oluolu extracts pairs of queries in the same session .
    • E.g., Oluolu extracts pair (Pthon and Python).
    • Queries in the same session : a set of queries submit by the same user within small time range.
    • Extracted pairs can be misspelled query and correct query.
    User ID Query Time 438904 Pthon 2009-11-21 12:16:12 34443 Java 2009-11-21 12:16:13 438904 Python 2009-11-21 12:16:20 8975 Tomcat 2009-11-21 12:16:25
  • 38. Step 2: validate candidate pairs
    • Oluolu validates all the query pairs extracted step 1.
    • In validation phase (step 2), Oluolu makes use of query readings.
  • 39. Reading of Japanese words
    • Japanese words can be convert into the readings in Latin Alphabets.
      • こんにちは (reading: konnichiha)
      • 伊藤 (reading: itou)
    • FACT : even when Japanese query with spelling mistakes
    • can be totally different from correct query,
      • the readings are the same or the distance is small !
  • 40. Validate candidate pair with reading
    • Given a query pairs, Oluolu validates the queries with 2
    • steps
      • Convert the queries into readings with Latin Alphabets
      • Compute edit distance with the two readings
        • When the distance is small, the two queries are extracted as a element of Did-You-Mean dictionary.
  • 41. Example: step 2
    • Given a pair of queries: ( 墨ともふどうさん , 住友不動産 )
    • Convert them into readings
      • readings are the same, “sumitomofudousan”.
    • Compute the distance with the readings
      • Distance is zero
      • Extracted as a element of Did-You-Mean dictionary
  • 42. Creating Japanese Did-You-Mean dictionary with Oluolu
    • Installation requirements
      • Java 1.6.0 or greater
      • Hadoop 0.20.0 or greater
      • Oluolu 0.2.0 or greater
    • Copy the input query log into HDFS
    • Run spellcheck task of oluolu
    • $ bin/oluolu spellcheck
    • -input testInput.txt
    • -output output
    • -inputLanguage ja
  • 43. Preliminary experiments
    • Experimental settings
      • Input data: log file from a mixi service (community search).
        • 5 GB data
    • Extracted dictionary
      • number of elements is over 100.000
      • succeeded to extract the query pairs with large edit distance.
        • ( 議 Ν, ギニュー )
        • ( 不動有利 , 不動裕理 )
  • 44. Current status
    • Finished functional tests and stress tests.
    • Now replacing an in-house search engine in a small search service with Anuenue.
    • In next phase, we will apply Anuenue to the search service with large data and high QPS.
  • 45. Future work
    • Integrate SolrCloud and Zookeeper
      • Support failover, and rebalance the index
    • Kuromoji, a new OSS Japanese tokenizer
  • 46. Summary
    • Introduction of Anuenue
    • Described a Did-You-Mean facility for Japanese query
  • 47. Thank you for your attention!