Lucandraを使ってみる

Lucandra を使ってみる 2010/6/25 佐藤　史彦

Agenda Lucandra ってなに？ Lucandra の構成できること使ってみるまとめ

A Cassandra-based Lucene backend Author : Jake Luciani

カサンドラベースのルシーンバックエンド作者 : ジェイクルシアーニ

Cassandra にインデックス機能を追加する、というより Lucene/Solr のインデックスをリアルタイムに作成、かつ手軽にスケールさせる目的でインデックスのストア先に Cassandra を採用したもの

Lucene Disk Java Application Hits Document Document Document Field Field Field インデックス作成 QueryParser Document Document Document 検索 Analyzer Query Lucene Index IndexReader IndexWriter Analyzer IndexSearcher

Luc andra Cassandra Java Application Hits Document Document Document Field Field Field インデックス作成 QueryParser Document Document Document 検索 Analyzer Query Lucene Index IndexReader IndexWriter Analyzer IndexSearcher

Index 構成 Keyspace : Lucandra ColumnFamily : Document Key : インデックス名のハッシュ + ドキュメント ID Column Name : フィールド名 Value : フールド値 SuperColumnFamily : TermInfo Key :( インデックス名 + フィールド名 ) のハッシュ + フィールド名 + 単語 SuperColumn : ドキュメント ID Column Name :Frequencies Value : 当該文書中の当該単語の出現頻度 Column Name :Norms Value : 当該単語における文書のノルム Column Name :Offsets Value : 当該文書中の当該単語のバイト位置オフセット Column Name :Position Value : 当該文書中の当該単語の出現位置

README より 1 Real-Time indexing 　 (documents become available almost immediately) 2 No optimizing 3 Search 4 Sort 5 Range Queries 6 Delete 7 Wildcards and other Lucene magic 8 Faceting/Highlighting 　 4,5,7 -> RandomPartitioner では不可

現状できないこと You can't walk the documents with index reader. 現状遅いこと Indexes with many documents and very dense terms.

環境 Cassandra は 0.6.2 ( 単体 )

ビルド下記より tar ball を DL します http://github.com/tjake/Lucandra ant で lucandra.jar をビルドします対応バージョン Lucene-2.9.1, Cassandra-0.6 $ tar xztf ls tjake-Lucandra-c632677.tar.gz $ cd tjake-Lucandra-c632677 $ ant lucandra.jar

storage-conf.xml の差し替え storage-conf.xml を差し替えて Cassandra を立ち上げます ※ Cassandra のデータが空である前提 $ cp config/storage-conf.xml ¥ /usr/local/cassandra/conf/ $ /usr/local/cassandra/bin/cassandra

storage-conf.xml のポイント <Keyspace Name="Lucandra" > <ColumnFamily CompareWith="BytesType" Name="Documents" KeysCached="10%" /> <ColumnFamily ColumnType="Super" CompareWith="BytesType" CompareSubcolumnsWith="BytesType" Name="TermInfo" KeysCached="10%" /> : : </Keyspace> <Partitioner> org.apache.cassandra.dht.OrderPreservingPartitioner </Partitioner> クラスタノードでは InitialToken も適切に設定すべき

Demo(BookmarksDemo) を試す Cassandra Bookmarks Demo Hits Document Document Document Field:url Field:title Field:tags -index QueryParser Document Document Document -search SimpleAnalyzer Query Lucene Index IndexReader IndexWriter SimpleAnalyzer IndexSearcher TSV File

動作確認 $ ./run_demo.sh -index bookmark.tsv $ ./run_demo.sh -search title:linu* Search matched: 5 item(s) 1. ZFS on FUSE/Linux http://zfs-on-fuse.blogspot.com/ 2. Set Up Postfix For Relaying Emails Through Another Mailserver | HowtoForge - Linux Howtos and Tutorials http://www.howtoforge.com/postfix_relaying_through_ another_mailserver 3. Debian GNU/Linux System Administration Resources http://www.debian-administration.org/ 4. Linux Scalability http://www.cs.wisc.edu/condor/condorg/linux_scalabi lity.html 5. LinuxDevCenter.com -- Cache-Friendly Web Pages http://www.linuxdevcenter.com/pub/a/linux/2002/02/ 28/cachefriendly.html

ひとまず動作することが確認できたので、日本語のサンプルを作ってみる。

サンプルデータ某飲食店検索 API を使ってこの近辺のデータを 980 件、 TSV にしておく id docID, INDEX, STORE name INDEX(ANALYZED), STORE url STORE address INDEX(ANALYZED), STORE tel INDEX(ANALYZED), STORE budget INDEX, STORE

サンプルプログラム BookmarksDemo をコピーして、下記の変更を加えます * Analyzer を変更 SimpleAnalyzer -> CJKAnalyzer * インデックス名を変更 bookmarks -> shopsearch * ドキュメントフィールドを　サンプルデータにあわせて変更

サンプル実行 $ ./run_shop.sh -index data.tsv $ ./run_shop.sh -search name: 丸の内 Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8 12:08:03,436 INFO CassandraProxyClient:145 - Connected to cassandra at 127.0.0.1:9160 name:" 丸のの内 " 12:08:03,863 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name 丸の to in 95ms 12:08:03,863 DEBUG LucandraTermEnum:246 - name 丸の has 115 12:08:03,869 DEBUG LucandraTermEnum:246 - name 丸ノ has 10 12:08:03,871 DEBUG LucandraTermEnum:285 - loadTerms: OxSo2Td8name 丸の (3) took 103ms 12:08:03,872 INFO IndexReader:153 - docFreq() took: 232ms 12:08:03,872 INFO IndexReader:153 - docFreq() took: 232ms 12:08:03,916 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name の内 to in 43ms 12:08:03,916 DEBUG LucandraTermEnum:246 - name の内 has 115 12:08:03,925 DEBUG LucandraTermEnum:246 - name の勘 has 2

サンプル実行 12:08:03,927 DEBUG LucandraTermEnum:285 - loadTerms: OxSo2Td8name の内 (3) took 54ms 12:08:03,927 INFO IndexReader:153 - docFreq() took: 54ms 12:08:03,947 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name 丸の in cache 12:08:03,953 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name の内 in cache Search matched: 0 item(s) あれ？　ヒットしない。。。（途中まではいい感じにみえるけど）

要因調査 CJKAnalyzer を使用した場合、 QueryParser.parse() は CJK 文字列を bi-gram に分割した Query を返却する name: 丸の内　↓ name:" 丸のの内 "

要因調査この際の Query は、 PhraseQuery のインスタンスになっている PhraseQuery が使用される場合、 LucandraTermDocs.nextPosition() がうまく機能しない (?) ためか、 Hit したドキュメントが抽出できていない

要因調査そこが問題のようだが、つっこんで調査しないと影響範囲とか読めないので、回避方法を検討。。。解明しました。詳細は付け足し資料 ( 補足編 ) にて。

回避方法そもそもなぜ bi-gram が PhraseQuery として扱われるのかを調べていたら、下記の情報がありました関口宏司の Lucene ブログ http://lucene.jugem.jp/?cid=5

回避方法これによると、 Lucene3.1 からは Analyzer により複数の単語が生成される場合、 PhraseQuery が生成される仕様を BooleanQuery に変えるべしと提案されており、 patch が提供されている

回避方法このパッチを強引にも 2.9.1 にあてます $ tar xzf lucene-2.9.1.tar.gz $ cd lucene-2.9.1 $ curl -O https://issues.apache.org/jira/secure/attachment /12445136/LUCENE-2458.patch $ patch -b -p1 < LUCENE-2458.patch

回避方法このままではビルドが通らないので下記 2 ファイルを Lucene のレポジトリからとってきます org/apache/lucene/util/ Version.java VirtualMethod.java ※ メソッドのバージョニング関連クラスで　本処理にはあまり影響なさそう？

回避方法パッチのあたったソースは Java5 以降の記述になっているため、 javac のオプションを変更してビルドします common-build.xml: 61: <property name="javac.source" value=" 6 "/> 62: <property name="javac.target" value=" 6 "/> 63: 64: <property name="javadoc.link" value="http://java.sun .com/ javase / 6 /docs/api/"/> $ ant

回避方法 build/lucene-core-2.9.1-dev.jar を Lucandra の lib/lucene-core-2.9.1.jar と差し替えます QueryParser のデフォルトオペレータを AND にして、再チャレンジ！！ ShopSearchDemo.java: QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer); qp.setDefaultOperator( Operator.AND );

$ ./run_shop.sh -search name: 丸の内 Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8 12:08:03,436 INFO CassandraProxyClient:145 - Connected to cassandra at 127.0.0.1:9160 +name: 丸の +name: の内 18:03:39,127 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name 丸の to in 109ms 18:03:39,127 DEBUG LucandraTermEnum:246 - name 丸の has 115 18:03:39,128 DEBUG LucandraTermEnum:246 - name 丸ノ has 10 18:03:39,130 DEBUG LucandraTermEnum:285 - loadTerms: OxSo2Td8name 丸の (3) took 112ms 18:03:39,131 INFO IndexReader:153 - docFreq() took: 222ms 18:03:39,189 DEBUG LucandraTermEnum:237 - Found 2 keys in range:OxSo2Td8name の内 to in 57ms 18:03:39,189 DEBUG LucandraTermEnum:246 - name の内 has 115 18:03:39,190 DEBUG LucandraTermEnum:246 - name の勘 has 2 18:03:39,190 DEBUG LucandraTermEnum:285 - loadTerms: OxSo2Td8name の内 (3) took 58ms 18:03:39,190 INFO IndexReader:153 - docFreq() took: 59ms 18:03:39,196 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name 丸の in cache 18:03:39,202 DEBUG LucandraTermEnum:176 - Found OxSo2Td8name の内 in cache

Search matched: 115 item(s) 09:52:16,739 DEBUG IndexReader:293 - Document read took: 10ms 1. Ｌｕｘｏｒ丸の内 http://r.gnavi.co.jp/g763393/ ¥9000 09:52:16,741 DEBUG IndexReader:293 - Document read took: 1ms 2. ｔｈｅＰａｎｔｒｙ丸の内店 http://r.gnavi.co.jp/g763381/ ¥1300 09:52:16,743 DEBUG IndexReader:293 - Document read took: 1ms 3. ＭＡＩＳＯＮ・ＢＡＲＳＡＣ丸の内 http://r.gnavi.co.jp/g763375/ ¥5500 09:52:16,750 DEBUG IndexReader:293 - Document read took: 2ms 4. 丸の内やんも http://r.gnavi.co.jp/g763373/ ¥8000 09:52:16,751 DEBUG IndexReader:293 - Document read took: 1ms 5. Ｖｉｎｐｉｃｏｅｕｒ～丸の内～ http://r.gnavi.co.jp/g763372/ ¥3500 09:52:16,753 DEBUG IndexReader:293 - Document read took: 1ms 6. ＤＥＡＮ＆ＤＥＬＵＣＡ～丸の内～ http://r.gnavi.co.jp/g763365/ ¥1500 09:52:16,755 DEBUG IndexReader:293 - Document read took: 2ms 7. Ｓ．Ｓｔｅｆａｎｏ～丸の内～ http://r.gnavi.co.jp/g763359/ ¥4500 : :

おお、なんかできてるっぽい

ソートも試してみる sort オプションを指定した場合に、 IndexSearcher.search() メソッドにて budget( 予算 ) フィールド値の降順でソートされるようにしてみます動かしてみます ☞ ShopSearchDemo.java: TopDocs docs = indexSearcher.search(q, null, 10, new Sort(new SortField("budget", SortField.INT, true)));

$ ./run_shop.sh -search name: 丸の内 sort Search matched: 115 item(s) 09:59:17,396 DEBUG IndexReader:293 - Document read took: 9ms 1. レストランモナリザ丸の内店～丸ビル～ http://r.gnavi.co.jp/g763345/ ¥10000 09:59:17,398 DEBUG IndexReader:293 - Document read took: 1ms 2. センチュリーコート丸の内 http://r.gnavi.co.jp/g038917/ ¥10000 09:59:17,399 DEBUG IndexReader:293 - Document read took: 1ms 3. Ｌｕｘｏｒ丸の内 http://r.gnavi.co.jp/g763393/ ¥9000 09:59:17,401 DEBUG IndexReader:293 - Document read took: 1ms 4. 丸の内やんも http://r.gnavi.co.jp/g763373/ ¥8000 09:59:17,402 DEBUG IndexReader:293 - Document read took: 1ms 5. たまさか丸の内店 http://r.gnavi.co.jp/e533319/ ¥8000 09:59:17,403 DEBUG IndexReader:293 - Document read took: 1ms 6. ワインショップエノテカ丸の内ザ・ラウンジ http://r.gnavi.co.jp/g763382/ ¥6300 09:59:17,405 DEBUG IndexReader:293 - Document read took: 1ms 7. 寿し屋の勘八旬～丸の内～ http://r.gnavi.co.jp/g763366/ ¥6000 :

ソートされてるっぽい

わかったこと 1 Lucandra は謳い文句通り Lucene のバックエンドに Cassandra を採用したものであり、アプリケーションは Lucene の資産 (API) をほぼそのまま利用することができる # PhraseQuery は要調査

わかったこと 2 Lucene の機能を十分に使うには、 OrderPreservingPartitioner を選択する必要がある Partitioner は現状 Cluster で共通であり RandomPartitoner のシンプルで効果的なデータ分散の恩恵を受けられないので、共用環境への導入は要検討

わかったこと 3 Cassandra の内部特性を利用することでインデックスの最適化を不要とし、リアルタイム性を高める構造である Twitter クライアントや RSS リーダーのような、ユーザーごとにインデックスが分かれていて総データ量が多く、即時に検索が必要な場面に向いていると思われる

わかったこと 4 当然だが、 Java でしか使えない他環境では、同梱の Solrandra を使って HTTP で利用するのだろう Java でも SolrJ を使って Solr のインデックス管理、キャッシュ機構を利用するのがベターなのかも

今後の課題もう少し Lucene/Solr 勉強したら？ Solrandra ベースでの実用性検証データ量とパフォーマンス検証 RandomPartitioner での動作検証 PhraseQuery . . .

参考 ■ A Cassandra-based Lucene backend http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/ ■ slideshare - Lucandra http://www.slideshare.net/otisg/lucandra ■ Cassandra: RandomPartitioner vs OrderPreservingPartitioner http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ ■ 関口宏司の Lucene ブログ http://lucene.jugem.jp/

Lucandra を使ってみる　　〜補足編〜 PhraseQuery を調べました ... 2010/7/15 佐藤　史彦

前回のあらすじ ,[object Object],[object Object],[object Object]

でもなんかひっかかる ... ,[object Object],[object Object],[object Object]

ということでソースを追いかけ ... TermInfo （転置インデックス）に Position （単語の出現位置）がないと PhraseQuery が機能しないことがわかりました。 ※ Position を記録するには、インデクシング時に　指定する必要がある。 ※ 本家 Lucene はなくてもいけるのに ...

Index 構成（前回資料より抜粋） Keyspace : Lucandra ColumnFamily : Document Key : インデックス名のハッシュ + ドキュメント ID Column Name : フィールド名 Value : フールド値 SuperColumnFamily : TermInfo Key :( インデックス名 + フィールド名 ) のハッシュ + フィールド名 + 単語 SuperColumn : ドキュメント ID Column Name :Frequencies Value : 当該文書中の当該単語の出現頻度 Column Name :Norms Value : 当該単語における文書のノルム Column Name :Offsets Value : 当該文書中の当該単語のバイト位置オフセット Column Name :Position Value : 当該文書中の当該単語の出現位置コレ

で、どうする？ ,[object Object],[object Object],doc.add(new Field("name", name, Store. YES , Index. ANALYZED , Field.TermVector. WITH_POSITIONS )); <field name= "name" type= "text_cjk" indexed= "true" stored= "true" termPositions= "true" /> インデックス生成時に Position s を指定する

実行結果 ( 一部省略 ) $ ./run_shop -index data.tsv $ ./run_shop.sh -search name: 丸の内 name:" 丸のの内 " Search matched: 115 item(s) 1. Ｌｕｘｏｒ丸の内 http://r.gnavi.co.jp/g763393/ ¥9000 2. ｔｈｅＰａｎｔｒｙ丸の内店 http://r.gnavi.co.jp/g763381/ ¥1300 3. ＭＡＩＳＯＮ・ＢＡＲＳＡＣ丸の内 http://r.gnavi.co.jp/g763375/ ¥5500 4. 丸の内やんも http://r.gnavi.co.jp/g763373/ ¥8000 5. Ｖｉｎｐｉｃｏｅｕｒ～丸の内～ http://r.gnavi.co.jp/g763372/ ¥3500 6. ＤＥＡＮ＆ＤＥＬＵＣＡ～丸の内～ http://r.gnavi.co.jp/g763365/ ¥1500 7. Ｓ．Ｓｔｅｆａｎｏ～丸の内～ http://r.gnavi.co.jp/g763359/ ¥4500

無事、検索できました ,[object Object],[object Object],以上です。

Lucandraを使ってみる

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lucandraを使ってみる

Similar to Lucandraを使ってみる (20)

Recently uploaded

Recently uploaded (9)

Lucandraを使ってみる