現在、DMM.comでは、1日あたり1億レコード以上の行動ログを中心に、各サービスのコンテンツ情報や、地域情報のようなオープンデータを収集し、データドリブンマーケティングやマーケティングオートメーションに活用しています。しかし、データの規模が増大し、その用途が多様化するにともなって、データ処理のレイテンシが課題となってきました。本発表では、既存のデータ処理に用いられていたHiveの処理をHive on Sparkに置き換えることで、1日あたりのバッチ処理の時間を3分の1まで削減することができた事例を紹介し、Hive on Sparkの導入方法やメリットを具体的に解説します。
Hadoop / Spark Conference Japan 2016
http://www.eventbrite.com/e/hadoop-spark-conference-japan-2016-tickets-20809016328
14. / 103
Hive on Spark – 導入手順の紹介
• CDHを使う
• Configuring Hive on Spark
http://www.cloudera.com/documentation/enterprise/latest/topics/admin_hos_config.html
• Apacheコミュニティ版を使う
• Hive on Spark: Getting Started
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
14
■ Important: Hive on Spark is included in CDH 5.4 and higher but is not currently supported
nor recommended for production use.
16. / 103
Hive on Spark – 導入手順の紹介
• クエリ実行時に engine パラメタを設定
16
-- Hive on Spark
SET hive.execution.engine=spark;
-- Hive on MapReduce
SET hive.execution.engine=mr;
-- Hive on Tez
SET hive.execution.engine=tez;
※ 参考
70. / 103
【補足】 WITH句の活用
• ネストしたクエリは読みにくい(WITH句を使わない場合)
70
http://www.slideshare.net/MarkusWinand/modern-sql
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
71. / 103
【補足】 WITH句の活用
• ネストしたクエリは読みにくい(WITH句を使わない場合)
71
http://www.slideshare.net/MarkusWinand/modern-sql
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
72. / 103
【補足】 WITH句の活用
• ネストしたクエリは読みにくい(WITH句を使わない場合)
72
http://www.slideshare.net/MarkusWinand/modern-sql
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
73. / 103
【補足】 WITH句の活用
• ネストしたクエリは読みにくい(WITH句を使わない場合)
73
http://www.slideshare.net/MarkusWinand/modern-sql
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
74. / 103
【補足】 WITH句の活用
• ネストしたクエリは読みにくい(WITH句を使わない場合)
74
http://www.slideshare.net/MarkusWinand/modern-sql
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
75. / 103
【補足】 WITH句の活用
• WITH句を使って書き換え
75
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
http://www.slideshare.net/MarkusWinand/modern-sql
WITH
-- ▽ 最初の処理
d AS ( SELECT b1, b2 FROM b ),
-- ▽ 2番目の処理
e AS (
SELECT a1, a2, b2
FROM a JOIN d ON (a.a1 = d.b1)
),
-- ▽ 3番目の処理
f AS ( SELECT c1, c2 FROM c )
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM e
JOIN f ON (e.a2 = f.c1);
76. / 103
【補足】 WITH句の活用
• WITH句を使って書き換え
76
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
http://www.slideshare.net/MarkusWinand/modern-sql
WITH
-- ▽ 最初の処理
d AS ( SELECT b1, b2 FROM b ),
-- ▽ 2番目の処理
e AS (
SELECT a1, a2, b2
FROM a JOIN d ON (a.a1 = d.b1)
),
-- ▽ 3番目の処理
f AS ( SELECT c1, c2 FROM c )
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM e
JOIN f ON (e.a2 = f.c1);
77. / 103
【補足】 WITH句の活用
• WITH句を使って書き換え
77
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
http://www.slideshare.net/MarkusWinand/modern-sql
WITH
-- ▽ 最初の処理
d AS ( SELECT b1, b2 FROM b ),
-- ▽ 2番目の処理
e AS (
SELECT a1, a2, b2
FROM a JOIN d ON (a.a1 = d.b1)
),
-- ▽ 3番目の処理
f AS ( SELECT c1, c2 FROM c )
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM e
JOIN f ON (e.a2 = f.c1);
78. / 103
【補足】 WITH句の活用
• WITH句を使って書き換え
78
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
http://www.slideshare.net/MarkusWinand/modern-sql
WITH
-- ▽ 最初の処理
d AS ( SELECT b1, b2 FROM b ),
-- ▽ 2番目の処理
e AS (
SELECT a1, a2, b2
FROM a JOIN d ON (a.a1 = d.b1)
),
-- ▽ 3番目の処理
f AS ( SELECT c1, c2 FROM c )
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM e
JOIN f ON (e.a2 = f.c1);
79. / 103
【補足】 WITH句の活用
• WITH句を使って書き換え
79
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM (
-- ▽ 2番目の処理
SELECT a1, a2, b2
FROM a
JOIN (
-- ▽ 最初の処理
SELECT b1, b2 FROM b
) AS d ON (a.a1 = d.b1)
) AS e
JOIN (
-- ▽ 3番目の処理
SELECT c1, c2 FROM c
) AS f ON (e.a2 = f.c1);
http://www.slideshare.net/MarkusWinand/modern-sql
WITH
-- ▽ 最初の処理
d AS ( SELECT b1, b2 FROM b ),
-- ▽ 2番目の処理
e AS (
SELECT a1, a2, b2
FROM a JOIN d ON (a.a1 = d.b1)
),
-- ▽ 3番目の処理
f AS ( SELECT c1, c2 FROM c )
-- ▽ 最後の処理
SELECT a1, a2, b2, c1, c2
FROM e
JOIN f ON (e.a2 = f.c1);
81. / 103
【補足】 CASE式の活用
• CASE式
• SQL92 で定義
• 条件に従って値を返す式
• Spark でボトルネックとなりやすい箇所はディスクI/O
• テーブル走査の回数は可能な限り減らしたい
• UNION ALL や LEFT OUTER JOIN を CASE式 で書き換え
• 行持ちのデータを列持ちに変換できる
81
82. / 103
【補足】 CASE式の活用
• 例:CTR, CVR の集計(CASE式を使わない場合)
82
WITH
action_log AS
( SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'click' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'click' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'purchase' AS action
UNION ALL SELECT '2016-02-02' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-02' AS dt, 'view' AS action
),
t1 AS (
SELECT dt, action, COUNT(*) AS ct FROM action_log GROUP BY dt, action
)
SELECT
v.dt, COALESCE(c.ct / v.ct, 0.0) AS ctr, COALESCE(p.ct / c.ct, 0.0) AS cvr
FROM t1 AS v
LEFT OUTER JOIN t1 AS c ON v.dt = c.dt AND c.action = 'click'
LEFT OUTER JOIN t1 AS p ON v.dt = p.dt AND p.action = 'purchase'
WHERE v.action = 'view';
dt action
2016-02-01 view
2016-02-01 view
2016-02-01 view
2016-02-01 click
2016-02-01 click
2016-02-01 purchase
2016-02-02 view
2016-02-02 view
action_log
83. / 103
【補足】 CASE式の活用
• 例:CTR, CVR の集計(CASE式を使わない場合)
83
WITH
action_log AS
( SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'click' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'click' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'purchase' AS action
UNION ALL SELECT '2016-02-02' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-02' AS dt, 'view' AS action
),
t1 AS (
SELECT dt, action, COUNT(*) AS ct FROM action_log GROUP BY dt, action
)
SELECT
v.dt, COALESCE(c.ct / v.ct, 0.0) AS ctr, COALESCE(p.ct / c.ct, 0.0) AS cvr
FROM t1 AS v
LEFT OUTER JOIN t1 AS c ON v.dt = c.dt AND c.action = 'click'
LEFT OUTER JOIN t1 AS p ON v.dt = p.dt AND p.action = 'purchase'
WHERE v.action = 'view';
dt action ct
2016-02-01 view 3
2016-02-01 click 2
2016-02-01 purchase 1
2016-02-02 view 2
t1
84. / 103
【補足】 CASE式の活用
• 例:CTR, CVR の集計(CASE式を使わない場合)
84
WITH
action_log AS
( SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'click' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'click' AS action
UNION ALL SELECT '2016-02-01' AS dt, 'purchase' AS action
UNION ALL SELECT '2016-02-02' AS dt, 'view' AS action
UNION ALL SELECT '2016-02-02' AS dt, 'view' AS action
),
t1 AS (
SELECT dt, action, COUNT(*) AS ct FROM action_log GROUP BY dt, action
)
SELECT
v.dt, COALESCE(c.ct / v.ct, 0.0) AS ctr, COALESCE(p.ct / c.ct, 0.0) AS cvr
FROM t1 AS v
LEFT OUTER JOIN t1 AS c ON v.dt = c.dt AND c.action = 'click'
LEFT OUTER JOIN t1 AS p ON v.dt = p.dt AND p.action = 'purchase'
WHERE v.action = 'view';
dt ctr cvr
2016-02-01 0.666 0.5
2016-02-02 0 0
85. / 103
【補足】 CASE式の活用
• 例:CTR, CVR の集計(CASE式を使った場合)
85
t1 AS (
SELECT
dt
, SUM(CASE action WHEN 'view' THEN 1 END) AS view_ct
, SUM(CASE action WHEN 'click' THEN 1 END) AS click_ct
, SUM(CASE action WHEN 'purchase' THEN 1 END) AS purchase_ct
FROM action_log
GROUP BY dt
)
SELECT
dt
, COALESCE( click_ct / view_ct, 0.0) AS ctr
, COALESCE( purchase_ct / click_ct, 0.0) AS cvr
FROM t1;
dt CASE view CASE click CASE purchase
2016-02-01 1 NULL NULL
2016-02-01 1 NULL NULL
2016-02-01 1 NULL NULL
2016-02-01 NULL 1 NULL
2016-02-01 NULL 1 NULL
2016-02-01 NULL NULL 1
2016-02-02 1 NULL NULL
2016-02-02 1 NULL NULL
86. / 103
【補足】 CASE式の活用
• 例:CTR, CVR の集計(CASE式を使った場合)
86
t1 AS (
SELECT
dt
, SUM(CASE action WHEN 'view' THEN 1 END) AS view_ct
, SUM(CASE action WHEN 'click' THEN 1 END) AS click_ct
, SUM(CASE action WHEN 'purchase' THEN 1 END) AS purchase_ct
FROM action_log
GROUP BY dt
)
SELECT
dt
, COALESCE( click_ct / view_ct, 0.0) AS ctr
, COALESCE( purchase_ct / click_ct, 0.0) AS cvr
FROM t1;
dt view_ct click_ct purchase_ct
2016-02-01 3 2 1
2016-02-02 2 NULL NULL
t1
87. / 103
【補足】 CASE式の活用
• 例:CTR, CVR の集計(CASE式を使った場合)
87
t1 AS (
SELECT
dt
, SUM(CASE action WHEN 'view' THEN 1 END) AS view_ct
, SUM(CASE action WHEN 'click' THEN 1 END) AS click_ct
, SUM(CASE action WHEN 'purchase' THEN 1 END) AS purchase_ct
FROM action_log
GROUP BY dt
)
SELECT
dt
, COALESCE( click_ct / view_ct, 0.0) AS ctr
, COALESCE( purchase_ct / click_ct, 0.0) AS cvr
FROM t1;
dt ctr cvr
2016-02-01 0.666 0.5
2016-02-02 0 0
95. / 103
【補足】 WINDOW関数の活用
• 例:時系列データの解析
95
WITH
access_log AS (
SELECT '2016-02-01' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-02' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-07' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-01' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-05' AS dt, 'BBB' AS username
)
SELECT
dt, username
, LAG(dt) OVER(PARTITION BY username ORDER BY dt) AS last_access
, DATEDIFF(dt, LAG(dt) OVER(PARTITION BY username ORDER BY dt)) AS access_span
, COUNT(1) OVER(PARTITION BY username ORDER BY dt) AS cumulative_access
FROM access_log;
dt usern
ame
2016-02-01 AAA
2016-02-02 AAA
2016-02-03 AAA
2016-02-07 AAA
2016-02-01 BBB
2016-02-03 BBB
2016-02-05 BBB
96. / 103
【補足】 WINDOW関数の活用
• 例:時系列データの解析
96
WITH
access_log AS (
SELECT '2016-02-01' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-02' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-07' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-01' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-05' AS dt, 'BBB' AS username
)
SELECT
dt, username
, LAG(dt) OVER(PARTITION BY username ORDER BY dt) AS last_access
, DATEDIFF(dt, LAG(dt) OVER(PARTITION BY username ORDER BY dt)) AS access_span
, COUNT(1) OVER(PARTITION BY username ORDER BY dt) AS cumulative_access
FROM access_log;
dt usern
ame
last_access
2016-02-01 AAA NULL
2016-02-02 AAA 2016-02-01
2016-02-03 AAA 2016-02-02
2016-02-07 AAA 2016-02-03
2016-02-01 BBB NULL
2016-02-03 BBB 2016-02-01
2016-02-05 BBB 2016-02-03
97. / 103
【補足】 WINDOW関数の活用
• 例:時系列データの解析
97
WITH
access_log AS (
SELECT '2016-02-01' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-02' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-07' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-01' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-05' AS dt, 'BBB' AS username
)
SELECT
dt, username
, LAG(dt) OVER(PARTITION BY username ORDER BY dt) AS last_access
, DATEDIFF(dt, LAG(dt) OVER(PARTITION BY username ORDER BY dt)) AS access_span
, COUNT(1) OVER(PARTITION BY username ORDER BY dt) AS cumulative_access
FROM access_log;
dt usern
ame
last_access access
_span
2016-02-01 AAA NULL NULL
2016-02-02 AAA 2016-02-01 1
2016-02-03 AAA 2016-02-02 1
2016-02-07 AAA 2016-02-03 4
2016-02-01 BBB NULL NULL
2016-02-03 BBB 2016-02-01 2
2016-02-05 BBB 2016-02-03 2
98. / 103
【補足】 WINDOW関数の活用
• 例:時系列データの解析
98
WITH
access_log AS (
SELECT '2016-02-01' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-02' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-07' AS dt, 'AAA' AS username
UNION ALL SELECT '2016-02-01' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-03' AS dt, 'BBB' AS username
UNION ALL SELECT '2016-02-05' AS dt, 'BBB' AS username
)
SELECT
dt, username
, LAG(dt) OVER(PARTITION BY username ORDER BY dt) AS last_access
, DATEDIFF(dt, LAG(dt) OVER(PARTITION BY username ORDER BY dt)) AS access_span
, COUNT(1) OVER(PARTITION BY username ORDER BY dt) AS cumulative_access
FROM access_log;
dt usern
ame
last_access access
_span
cumulative
_acess
2016-02-01 AAA NULL NULL 1
2016-02-02 AAA 2016-02-01 1 2
2016-02-03 AAA 2016-02-02 1 3
2016-02-07 AAA 2016-02-03 4 4
2016-02-01 BBB NULL NULL 1
2016-02-03 BBB 2016-02-01 2 2
2016-02-05 BBB 2016-02-03 2 3