PipelineDB とは?
2016/07/22
Stream勉強会
@tamtam180
Agenda
• Self Introduce
• What’s PipelineDB
• What’s Continuous Query
• Continuous Transform/Trigger
• DEMO
• Enterprise Edition
• Tips
Self Introduce
• Twitter: @tamtam180
• Works
– SquareEnix
• PlayOnline, FF-XIV
–Server Programmer
– SmartNews
• Advertising
–Software Engineer
What’s PipelineDB
What’s PipelineDB
• OSS Database (+Enterprise Edition)
– GPLv3
• Support Continuous Query
• on PostgreSQL as extension
– 0.8.x on 9.4, 0.9.x on 9.5
– No special client libraries
• Support probabilistic data structure & algorithm
– Bloom-filter, hyperloglog, Count-Min sketch,
– FSS Top-K, T-Digest
What’s Continuous Query
What’s Continuous Query
• RDB
Timestamp ChannelID CampaignID UserID Sales
2016/06/18	13:41:10 1000 10 100 30
2016/06/18	13:43:15 1000 10 101 20
2016/06/18	13:47:20 1001 11 123 15
2016/06/18	14:10:10 1000 12 100 30
2016/06/19	14:15:30 1002 14 101 20
2016/06/19	15:16:30 1003 11 100 15
2016/06/19	16:17:56 1001 14 123 30
Aggregate
What’s Continuous Query
• RDB
Timestamp ChannelID CampaignID UserID Sales
2016/06/18	13:41:10 1000 10 100 30
2016/06/18	13:43:15 1000 10 101 20
2016/06/18	13:47:20 1001 11 123 15
2016/06/18	14:10:10 1000 12 100 30
2016/06/19	14:15:30 1002 14 101 20
2016/06/19	15:16:30 1003 11 100 15
2016/06/19	16:17:56 1001 14 123 30
Aggregate
SELECT
TO_CHAR(timestamp, ‘YYYY-MM-DD’) as ymd,
campaignId, SUM(sales)
FROM clicks
WHERE
timestamp < NOW() - INTERVAL ’-3 day’
GROUP BY ymd, campaignId
What’s Continuous Query
• PipelineDB
Stream
CV
CV
data record
CREATE STREAM stream_name (
timestamp TIMESTAMP,
channelId BIGINT,
campaignId BIGINT,
userId BIGINT,
sales BIGINT
);
CREATE CONTINUOUS VIEW cv_name WITH(max_age=‘3 days’) AS
SELECT
TO_CHAR(timestamp, ‘YYYY-MM-DD’) as ymd,
campaignId, SUM(sales)
FROM
stream_name
GROUP BY
ymd, campaignId;
What’s Continuous Query
• Continuous Query: STREAM
CREATE STREAM stream_name (
timestamp TIMESTAMP,
channelId BIGINT,
campaignId BIGINT,
userId BIGINT,
sales BIGINT
);
Internal
• Stream
–外部テーブル (foreign table)
–外部サーバ Foreign Server
•pipeline_streams
–Foreign Data Wrapper
• stream_fdw
• Stream Buffer
Internal
STREAM
BUFFER
Query	on
microbatch
Incremental
table	update
HeapTuple
HeapTuple
HeapTuple_
HeapTuple_
HeapTuple_
INSERT	INTO..
Concurrent
circular	buffer
Preallocated block	
of	shared	memory
HeapTuple
{0,1,0,1,1}
WIP
(pokemon go)
Internal
https://wiki.postgresql.org/images/a/ad/
South_Bay_PG_Meetup_2016-03-08_PipelineDB.pdf
What’s Continuous Query
• Continuous Query: STREAM
What’s Continuous Query
• Continuous View
CREATE CONTINUOUS VIEW cv_name AS
SELECT
TO_CHAR(timestamp, ‘YYYY-MM-DD’) as ymd,
campaignId, SUM(sales)
FROM
stream_name
WHERE
arrival_timestamp > clock_timestamp() - interval ‘3 days’
GROUP BY
ymd, campaignId;
Deprecated
What’s Continuous Query
• Continuous View
CREATE CONTINUOUS VIEW cv_name WITH(max_age=‘3 days’)
AS
SELECT
TO_CHAR(timestamp, ‘YYYY-MM-DD’) as ymd,
campaignId, SUM(sales)
FROM
stream_name
GROUP BY
ymd, campaignId;
What’s Continuous Query
What’s Continuous Query
• Data Insert
INSERT INTO stream_name (timestamp, campaignId, sales)
VALUES
(‘2016-07-22 11:00:01’, 100, 25),
(‘2016-07-22 11:00:02’, 101, 20),
(‘2016-07-22 11:00:03’, 101, 22)
;
What’s Continuous Query
• Also use COPY statement
COPY stream_name (timestamp, campaignId, sales)
FROM ‘/some/path/file.csv’
COPY stream_name (timestamp, campaignId, sales)
FROM STDIN
What’s Continuous Query
On SmartNews Ads
On SmartNews Ads
APP
FluentdFILE
Kinesis
Consumer
Pipeline
DB
JSON
JSON
Chartio
Console
On SmartNews Ads
• 1 Column JSONB
CREATE STREAM imp_stream ( item JSONB );
CREATE STREAM vimp_stream ( item JSONB );
CREATE STREAM click_stream ( item JSONB );
On SmartNews Ads
• Create Counter table per stream
CREATE CONTINUOUS VIEW imp_count
WITH(max_age='7 day', step_factor=1)
AS
SELECT
(TO_CHAR(TO_TIMESTAMP((item::jsonb->>'timestamp')::bigint),
'YYYY-MM-DD HH24:00:00'))::timestamp as dt,
COUNT(*) as cnt
FROM imp_stream
GROUP BY dt;
STREAMに紐付くCVが1つも無いとINSERT時に
警告が出まくる
On SmartNews Ads
• この状態を作っておけば
STREAM
Consumer
STREAM
CV-1
CV-1
JSON
On SmartNews Ads
• Consumer等は弄る必要が無く、
• 後はCVを定義していくだけ
STREAM
Consumer
STREAM
CV-1
CV-1
CV-3
CV-2
JSON
probabilistic data structure
HLL
HLL
• HyperLogLog
– 異なり数を数えます
– 推定値
– ユニークユーザの算出
– HLLは合成ができる
https://stefanheule.com/papers/edbt13-hyperloglog.pdf
HLL
• Distinct count => HLLで算出
CREATE CONTINUOUS VIEW imp_count
WITH(max_age='7 day', step_factor=1)
AS
SELECT
(TO_CHAR(TO_TIMESTAMP((item::jsonb->>'timestamp')::bigint),
'YYYY-MM-DD HH24:00:00'))::timestamp as dt,
COUNT(distinct (item->>'uuid')::text) as uuid_ucnt
FROM imp_stream
GROUP BY dt;
exact_count_distinctを使えば正確な値も算出
できる
HLL
• HyperLogLogの合成
• よくある問題
– 1時間単位のユニークユーザーが欲しい
– 1日単位のユニークユーザーが欲しい
– 任意の期間のユニークユーザーが欲しい
– 複数リージョンにオフィスがあるので、Timezoneを考慮して出
して欲しい(例: JSTとPST)
HLL
• 1時間単位でHLLのまま保持
CREATE CONTINUOUS VIEW test_cv WITH(max_age='30 days')
AS
SELECT
to_char(to_timestamp((item->>'timestamp')::bigint + 3600*9), 'YYYY-
MM-DD') as ymd_jst,
date_part('hour', to_timestamp((item->>'timestamp')::bigint +
3600*9))::integer as h_jst,
hll_agg((item->>'uuid')::text) as uuid_agg,
FROM test_stream
GROUP BY
ymd_jst, h_jst;
HLL
• SELECT時にMergeする
SELECT
ymd_jst, h_jst,
hll_cardinality(combine(uuid_agg)) as ucnt
FROM test_cv
GROUP BY ymd_jst, h_jst
ORDER BY ymd_jst DESC, h_jst DESC
LIMIT 24 * 3;
SELECT
ymd_jst,
hll_cardinality(combine(uuid_agg)) as ucnt
FROM test_cv
GROUP BY ymd_jst
ORDER BY ymd_jst DESC
LIMIT 24 * 3;
HOURLY
DAILY
HLL
Continuous Transform
Continuous Transform
• Streamに入ってきたデータを処理して
別のStreamにコピーする
– フィルタリング
– データの整形
– 何もしないで複製
Continuous Transform
• Streamに入ってきたデータを処理して
別のStreamにコピーする
Stream StreamContinuous
Transform
CVCV
Continuous Transform
• TRANSFORMを定義する
CREATE CONTINUOUS TRANSFORM xxx_etl AS
SELECT item::jsonb
FROM xxx_stream
WHERE
to_timestamp((item->'obj'-
>>'timestamp_sec')::bigint) > clock_timestamp() -
interval '7 days’
AND (item->'obj'->>'flag')::bigint = 1
THEN EXECUTE PROCEDURE
pipeline_stream_insert('xxx_stream_etl')
pipeline_stream_insert
は組み込みで定義されている
自分でも定義できる
Continuous Transform
• CTを2つ定義すればStreamのコピーができる
Stream StreamContinuous
Transform
Continuous
Transform
Stream
Continuous Trigger
Continuous Trigger
• Trigger
– いわゆるトリガーです
– あるキャンペーンの消化金額が日予算を超えたらアラートを通知
– Impression SmoothingがBehindしている時に通知
• とかが出来る
– 通知
• 別のテーブルにレコードを挿入
• HTTP通信をしてWebHookを叩く
• EMAILを投げる(MailGunとか使う)
サンプルを作る時間が無かった!!
padhoc
padhoc
• ちょっと確認したい場合に、毎回CVを作るのは大変
• padhocコマンドを使うと、インスタントに確認できる
• 設定に` continuous_queries_adhoc_enabled = true` が必要
– デフォルトはなぜか false
• padhoc -c “SELECT count(*) FROM stream”
• padhoc -c "select (item->>'ip')::text as ip, count(*) from imp_stream group by ip" -d
pipeline
Enterprise Edition
Enterprise Edition
• NDAなんだ。すまない。
• Extensionとして提供
• 出来る事
–クラスタ
•READ最適化(デフォルト)
•WRITE最適化
–HA
Tips
Tips: timestamp
• STREAM ó CTはTransactionがずっとかかって
いることになっている
–clock_timestamp()
–arrival_timestamp
–now()
Tips: timestamp
• clock_timestamp()
• Only one time in statement
Current date and time
(changes during statement execution)
ERROR: clock_timestamp() may only appear
once in a WHERE clause
Tips
• 他にもTipsはあるけど、
気になったら直接聞いてください。
We are hiring!
• SmartNewsではエンジニアを募集しているみたいです
–広告エンジニア
–フロントエンドエンジニア
–iOS/Androidエンジニア
–プロダクティビティエンジニア
–機械学習/自然言語処理エンジニア
–などなど
• http://about.smartnews.com/ja/careers/
ランチ
ありがとうございました。

PipelineDBとは?