PostgreSQL Conference Japan 2017の資料です。
PostgreSQLが苦手な全文検索機能を強力に補完する拡張機能PGroongaの最新情報を紹介します。PGroongaは単なる全文検索モジュールではありません。PostgreSQLで簡単にリッチな全文検索システムを実現するための機能一式を提供します。PGroongaは最新PostgreSQLへの対応も積極的です。例としてロジカルレプリケーションと組み合わせた活用方法も紹介します。
PostgreSQL Conference Japan 2017の資料です。
PostgreSQLが苦手な全文検索機能を強力に補完する拡張機能PGroongaの最新情報を紹介します。PGroongaは単なる全文検索モジュールではありません。PostgreSQLで簡単にリッチな全文検索システムを実現するための機能一式を提供します。PGroongaは最新PostgreSQLへの対応も積極的です。例としてロジカルレプリケーションと組み合わせた活用方法も紹介します。
Deep dive into internal of Fluentd1.2: trace it in the level of source code. Giving a brief introduction to Fluent Bit and compare its performance with Fluentd.
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowKouhei Sutou
I introduced Ruby and Apache Arrow integration including the "super fast large data interchange and processing" Apache Arrow feature at RubyKaigi Takeout 2021.
This talk introduces how we can use the "super fast large data interchange and processing" Apache Arrow feature in Ruby. Here are some use cases:
* Fast data retrieval (fast pluck) from DB such as MySQL and PostgreSQL for batch processes in a Ruby on Rails application
* Fast data interchange with JavaScript for dynamic visualization in a Ruby on Rails application
* Fast OLAP with in-process DB such as DuckDB and Apache Arrow DataFusion in a Ruby on Rails application or irb session
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache ArrowKouhei Sutou
To use Ruby for data processing widely, Apache Arrow support is important. We can do the followings with Apache Arrow:
* Super fast large data interchange and processing
* Reading/writing data in several famous formats such as CSV and Apache Parquet
* Reading/writing partitioned large data on cloud storage such as Amazon S3
This talk describes the followings:
* What is Apache Arrow
* How to use Apache Arrow with Ruby
* How to integrate with Ruby 3.0 features such as MemoryView and Ractor
Apache Arrow 1.0 - A cross-language development platform for in-memory dataKouhei Sutou
Apache Arrow is a cross-language development platform for in-memory data. You can use Apache Arrow to process large data effectively in Python and other languages such as R. Apache Arrow is the future of data processing. Apache Arrow 1.0, the first major version, was released at 2020-07-24. It's a good time to know Apache Arrow and start using it.
Apache Arrow - A cross-language development platform for in-memory dataKouhei Sutou
Apache Arrow is the future for data processing systems. This talk describes how to solve data sharing overhead in data processing system such as Spark and PySpark. This talk also describes how to accelerate computation against your large data by Apache Arrow.
csv, one of the standard libraries, in Ruby 2.6 has many improvements:
* Default gemified
* Faster CSV parsing
* Faster CSV writing
* Clean new CSV parser implementation for further improvements
* Reconstructed test suites for further improvements
* Benchmark suites for further performance improvements
These improvements are done without breaking backward compatibility.
This talk describes details of these improvements by a new csv maintainer.
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
PGroonga 2.0 has been released with 2 years development since PGroonga 1.0.0. PGroonga 1.0.0 just provides fast full text search with all languages support. It's important because it's a lacked feature in PostgreSQL. PGroonga 2.0 provides more useful features to implement rich full text search system with PostgreSQL. This session shows how to implement rich full text search system with PostgreSQL!
This talk describes about PGroonga that resolves these problems.
PGroonga is fast and flexible full text search extension for PostgreSQL. Zulip is a chat tool that uses PostgreSQL and PGroonga. This talk describes why PGroonga is suitable for Zulip.
30. Groonga族2015 Powered by Rabbit 2.1.9
Mroongaを使う:検索
SELECT * FROM texts
WHERE
MATCH(content)
AGAINST('*D+ キーワード1 キーワード2'
IN BOOLEAN MODE);
ポイント:Web検索エンジンと同様にクエリーを書ける
36. Groonga族2015 Powered by Rabbit 2.1.9
例:MySQL 5.7と比較ポイント:MySQL 5.7のInnoDBは日本語全文検索対応!
クエリー 文字数 InnoDB
mecab
InnoDB
bigram
Mroonga
bigram
COUNT 1 2.62s N/A 0.74s
COUNT 2 0.07s 1.74s 0.27s
ORDER
BY
1 2.78s N/A 0.83s
ORDER
BY
2 0.09s 2.72s 0.31s
出典:MySQLの全文検索に関するあれやこれや by yoku0825
http://www.slideshare.net/yoku0825/mysql-47364986
InnoDB mecabは遅いことがある
Mroongaは安定して速い
37. Groonga族2015 Powered by Rabbit 2.1.9
MySQL 5.7で日本語全文検索
CREATE TABLE texts (
content TEXT,
FULLTEXT INDEX (content)
WITH PARSER mecab -- ←がポイント
) ENGINE=InnoDB
DEFAULT CHARSET=utf8mb4;
mecabの代わりにngramも指定できるが実用的ではなさそう
mecabを使い場合は↑の他にMeCabの設定が必要
https://dev.mysql.com/doc/refman/5.7/en/fulltext-search-
mecab.html
38. Groonga族2015 Powered by Rabbit 2.1.9
Mroonga vs MySQL 5.7
Mroonga
デフォルトでいい感じに動く
(選びたければ指定できる)
安定して速い
MySQL 5.7 InnoDB
パーサーを選ばないといけない
(mecab?ngram?違いは?)
遅いこともある
41. Groonga族2015 Powered by Rabbit 2.1.9
ORDER BY LIMIT最適化
SELECT * FROM texts
WHERE MATCH(content1)
AGAINST('...' IN BOOLEAN MODE)
ORDER BY priority
LIMIT 10;
MySQLに先頭10件しか返さない
ORDER BY LIMITをMySQLが処理するより
Groongaが処理する方が速いので速くなる
42. Groonga族2015 Powered by Rabbit 2.1.9
ORDER BY LIMIT最適化!
SELECT * FROM texts
WHERE MATCH(content)
AGAINST('...' IN BOOLEAN MODE)
AND n_likes < 100 -- 追加
ORDER BY priority
LIMIT 10;
Groonga:MATCH AGAINST以外も処理
MySQLが処理するよりGroongaが処理する方が速いので速くなる
64. Groonga族2015 Powered by Rabbit 2.1.9
Rubyでテーブル定義
Groonga::Schema.define do |schema|
schema.create_table("Users",
:type => :hash) do |table|
table.short_text("name")
table.int8("age")
end
end
67. Groonga族2015 Powered by Rabbit 2.1.9
Rubyで検索
twentys = users.select do |record|
(record.age >= 20) & (record.age < 30)
end
twentys.each do |twenty|
p twenty.name # => "Alice"
end
171. Groonga族2015 Powered by Rabbit 2.1.9
PGroonga: 1
text @@ pgroonga.queryを導入
SET enable_indexscan = off;
SET enable_bitmapscan = off;
SELECT * FROM WHERE content @@ '全文検索';
-- ↑PostgreSQL組み込みの@@が使われる
SELECT * FROM
WHERE content @@ '全文検索'::pgroonga.query;
-- ↑PGroonga提供の@@が使われる
173. Groonga族2015 Powered by Rabbit 2.1.9
PGroonga: 3
同義語展開対応
body @@ pgroonga.expand_query('ネジ')
-- ↓
body @@ 'ネジ OR ねじ OR ボルト'
174. Groonga族2015 Powered by Rabbit 2.1.9
PGroonga: 4
ステミング対応
found/finds→find
CREATE INDEX index ON entries
USING pgroonga (title)
WITH (token_filters = 'TokenStem');
175. Groonga族2015 Powered by Rabbit 2.1.9
PGroonga: 5
重み対応
-- タイトルのほうが本文より10倍重要
body @@ ('title * 10 || body', 'ポスグレ')
176. Groonga族2015 Powered by Rabbit 2.1.9
PGroonga: 6
複合主キー対応
CREATE TABLE t (
c1 INT,
c2 INT,
PRIMARY KEY (c1, c2)
);
CREATE INDEX index ON t
USING pgroonga (c1, c2);
177. Groonga族2015 Powered by Rabbit 2.1.9
PGroonga: 7
pg_dumpのWITH問題を解消
CREATE INDEX index ON t
USING pgroonga (c)
WITH (tokenizer = 'TokenMecab');
-- ↓pg_dump: クォートがとれる→小文字に正規化される
CREATE INDEX index ON t
USING pgroonga (c)
WITH (tokenizer = TokenMecab);