元気Hadoop!
平間 大輔
Insight Technology, Inc.
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
「Big Data」「Big Data」「ビッグデータ」!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
ビッグデータといえばHadoop…なぜ?

PB
Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
HadoopのコアはHDFSとMapReduce

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
DBエンジニアだってHadoopを使いたい!

Super Hadoop 2013 !
•
•
•
•
•
•
•

4ノードHadoopクラスタ
Cloudera CDH4 4.4.0
Cloudera Manager Standard 4....
DBサーバのログを分析してみよう!

Super RAC 2013 !
•
•
•
•
•

Oracle Database 12c
4ノードRAC構成
3台のストレージノード(自作PC)
DISKはSSD(アキバモデル)18枚
ノード間通信は...
その1:パフォーマンスログを分析してみよう
dstatでパフォーマンスログを取得

"Dstat 0.7.0 CSV output"
"Author:","Dag Wieers <dag@wieers.com>",,,,"URL:","http...
CSVならHiveでGO!

メタストアデータベース
(PostgreSQLなど)

CSV

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
CSVをテーブルとして定義
create external table dstat_cpu (
servername
string,
create_ymd
string,
create_second
int,
cpu0_user
DOUBLE,...
では実行してみよう!

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Hiveに速度を期待してはダメ
• クエリをMapReduceに変換するオーバーヘッド
• MapReduce自体のオーバーヘッド
• 売りは「開発生産性の高さ」、実行時の速さではない

ただし、大量データのバッチ処理には強い!
データ量が10...
Impalaって速いやつが出てるらしいよ

Cloudera社Webサイトより

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Impalaだとどれくらい?

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Impalaなら爆速! だけど…
TPC-H Q3 SF=10(GB)
(秒) 120
100
80

1/5以下!

60
40
20
0
Hive

Copyright © 2013 Insight Technology, Inc. All...
データ量が多いとちょっと厳しい…
TPC-H Q3 SF=100(GB)
(秒) 450
400
350
300
250
200

150
100
50
0
Hive
Copyright © 2013 Insight Technology, I...
データの量と質次第では本職に任せるのもあり
TPC-H Q3 SF=100(GB)
(秒) 450
400

350
300
250
200
150
100

2.7秒

50

0
Hive

Impala

Copyright © 2013...
その2:怪しいSQLを見つけられないかな?
select * from CUSTOMER
where C_LAST = ‘Hirama’;
これは通常の業務で
発行されるSQLなの?

Oracleの監査証跡

DB監査ツール

--CDBで実...
監査ログからSQLを抜き出そう
<AuditRecord><Audit_Type>1</Audit_Type><Session_Id>140037</Session_Id>
<DBID>409456161</DBID>
<Sql_Text>se...
機械学習で怪しいSQLを見つけられる?

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
Mahoutでクラシフィケーション
• クラシフィケーション(分類)の実行手順
1.
2.
3.
4.
5.
6.

訓練用データを人手で分類
シーケンスファイルへ変換
ベクトルデータに変換
訓練用データとテスト用データに分割
訓練してモデルを...
データの分類、変換
• 訓練用データを人手で分類
tpch
trainSql

tpcc
suspicious

• シーケンスファイルに変換
$ mahout seqdirectory -i trainSql -o trainSeq

中身は...
モデルの構築・テスト
• 訓練用データとテスト用データに分割
$ mahout split -i trainSparse/tfidf-vectors --trainingOutput trainData ¥
--testOutput train...
さて、テスト結果は?
Summary
------------------------------------------------------Correctly Classified Instances
:
8
Incorrectly Cl...
実環境に適用するにはまだまだ…
訓練データの
量・質

機械学習の
理解

Mahoutの
運用スキル

Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
DBエンジニアが元気にHadoopを使うには
1. 「埋もれたダイヤの原石を発掘」の発想で
2. 適材適所。RDBMSの置き換えではなく補完
3. 機械学習はデータが命
DBエンジニアの腕の見せ所

Copyright © 2013 Insig...
•無断転載を禁ず
•この文書はあくまでも参考資料であり、掲載されている情報は予告なしに変更されることがあります。
•株式会社インサイトテクノロジーは本書の内容に関していかなる保証もしません。また、本書の内容に関連したいかなる
損害についても責任...
Upcoming SlideShare
Loading in …5
×

[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

751 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
751
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

[C12]元気Hadoop! OracleをHadoopで分析しちゃうぜ by Daisuke Hirama

  1. 1. 元気Hadoop! 平間 大輔 Insight Technology, Inc. Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  2. 2. 「Big Data」「Big Data」「ビッグデータ」! Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  3. 3. ビッグデータといえばHadoop…なぜ? PB Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  4. 4. HadoopのコアはHDFSとMapReduce Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  5. 5. DBエンジニアだってHadoopを使いたい! Super Hadoop 2013 ! • • • • • • • 4ノードHadoopクラスタ Cloudera CDH4 4.4.0 Cloudera Manager Standard 4.7.2 Master (NameNode, JobTracker) 1台 Slave (DataNode, TaskTracker) 4台 (1台はMasterに同居) DISKはSSD(アキバモデル)12枚 クラスタ間通信はInfiniBand Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  6. 6. DBサーバのログを分析してみよう! Super RAC 2013 ! • • • • • Oracle Database 12c 4ノードRAC構成 3台のストレージノード(自作PC) DISKはSSD(アキバモデル)18枚 ノード間通信はInfiniBand SuperRACで実行させる処理 夜間バッチ処理: 午前1:00からTPC-Hを実行(10分程度) 日中のOLTP処理: 午前8:00からTPC-Cを実行(1時間) Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  7. 7. その1:パフォーマンスログを分析してみよう dstatでパフォーマンスログを取得 "Dstat 0.7.0 CSV output" "Author:","Dag Wieers <dag@wieers.com>",,,,"URL:","http://dag.wieers.com/home-made/dstat/" "Host:","iq-4node-db3",,,,"User:","root" "Cmdline:","dstat -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 --output /data01/logs/iq-4node-db3/dstat_cpu_ ヘッダと1行目を削除 "cpu0 usage",,,,,,"cpu1 usage",,,,,,"cpu2 usage",,,,,,"cpu3 usage",,,,,,"cpu4 usage",,,,,,"cpu5 usage",,,,,,"cpu6 usage",,,,,,"c "usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","wai","hiq","siq","usr","sys","idl","w 1.086,0.600,98.202,0.106,0.0,0.005,1.825,0.491,97.593,0.086,0.0,0.005,0.677,0.225,99.070,0.017,0.0,0.011,0.417,0.140,99.427,0.01 0.0,0.990,99.010,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0, 1.0,0.0,99.0,0.0,0.0,0.0,2.020,0.0,97.980,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0, 1.0,1.0,98.0,0.0,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0, 1.010,0.0,98.990,0.0,0.0,0.0,2.0,1.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0 10.891,0.990,87.129,0.990,0.0,0.0,16.667,2.941,79.412,0.980,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,6.931,0.990,92.079,0.0,0.0,0.0,1.0, 先頭にサーバ名と日付・時刻を追加 tail -86400 $fn | cat -n | sed 's/¥s¥+/,/g' | sed "s/^/${SVRNAME},${YESTERDAY}/" 加工して iq-4node-db3,20131030,1,0.0,0.990,99.010,0.0,0.0,0.0,0.0,1.0,99.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0. iq-4node-db3,20131030,2,1.0,0.0,99.0,0.0,0.0,0.0,2.020,0.0,97.980,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0. iq-4node-db3,20131030,3,1.0,1.0,98.0,0.0,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0. iq-4node-db3,20131030,4,1.010,0.0,98.990,0.0,0.0,0.0,2.0,1.0,97.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.990,0.0,99.010,0.0,0.0 iq-4node-db3,20131030,5,10.891,0.990,87.129,0.990,0.0,0.0,16.667,2.941,79.412,0.980,0.0,0.0,3.0,0.0,97.0,0.0,0.0,0.0,6.931,0.990 Hadoopへ hadoop fs -put dstat_cpu_iq-4node-db3_20131030.csv dstat_cpu Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  8. 8. CSVならHiveでGO! メタストアデータベース (PostgreSQLなど) CSV Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  9. 9. CSVをテーブルとして定義 create external table dstat_cpu ( servername string, create_ymd string, create_second int, cpu0_user DOUBLE, cpu0_sys DOUBLE, cpu0_idle DOUBLE, page_in DOUBLE, page_out DOUBLE, system_int DOUBLE, system_csw DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/root/dstat_cpu'; Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  10. 10. では実行してみよう! Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  11. 11. Hiveに速度を期待してはダメ • クエリをMapReduceに変換するオーバーヘッド • MapReduce自体のオーバーヘッド • 売りは「開発生産性の高さ」、実行時の速さではない ただし、大量データのバッチ処理には強い! データ量が10倍になっても処理時間は2倍に収まった例: ※ TPC-H用クエリ(22個)を一部Hive用に修正して実行 (秒) 8000 6000 4000 2000 0 SF=10(GB) SF=100(GB) Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  12. 12. Impalaって速いやつが出てるらしいよ Cloudera社Webサイトより Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  13. 13. Impalaだとどれくらい? Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  14. 14. Impalaなら爆速! だけど… TPC-H Q3 SF=10(GB) (秒) 120 100 80 1/5以下! 60 40 20 0 Hive Copyright © 2013 Insight Technology, Inc. All Rights Reserved. Impala
  15. 15. データ量が多いとちょっと厳しい… TPC-H Q3 SF=100(GB) (秒) 450 400 350 300 250 200 150 100 50 0 Hive Copyright © 2013 Insight Technology, Inc. All Rights Reserved. Impala
  16. 16. データの量と質次第では本職に任せるのもあり TPC-H Q3 SF=100(GB) (秒) 450 400 350 300 250 200 150 100 2.7秒 50 0 Hive Impala Copyright © 2013 Insight Technology, Inc. All Rights Reserved. 某世界最速RDB
  17. 17. その2:怪しいSQLを見つけられないかな? select * from CUSTOMER where C_LAST = ‘Hirama’; これは通常の業務で 発行されるSQLなの? Oracleの監査証跡 DB監査ツール --CDBで実行 alter system set audit_trail=xml, extended sid='*' scope=spfile; --PDBで実行 AUDIT SELECT TABLE BY ACCESS; AUDIT INSERT TABLE BY ACCESS; AUDIT UPDATE TABLE BY ACCESS; AUDIT DELETE TABLE BY ACCESS; ログ量:1日64GB… Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  18. 18. 監査ログからSQLを抜き出そう <AuditRecord><Audit_Type>1</Audit_Type><Session_Id>140037</Session_Id> <DBID>409456161</DBID> <Sql_Text>select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, order by l_returnflag, l_linestatus</Sql_Text> </AuditRecord> select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate &lt;= date &apos;1998-12-01&apos; - interval &apos;91&apos; day (3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus SQLのみ抜きだし、改行を削除して1行に • Hadoop StreamingでXMLタグの抜き出し hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoopstreaming-2.0.0-mr1-cdh4.4.0.jar ¥ -D mapred.reduce.tasks=0 ¥ -inputreader "StreamXmlRecordReader,begin=<Sql_Text>,end=</Sql_Text>" ¥ -input XmlSql ¥ -output Sql ¥ -mapper cutlftag.sh ¥ -file cutlftag.sh • Hadoop Streamingなら、スクリプトでMapReduceが可能 #!/bin/sh tr -d "¥n" | sed -e "s/¥t/¥ /g" |sed -e "s/<Sql_Text>//g" | sed -e "s/<¥/Sql_Text>/¥n/g" Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  19. 19. 機械学習で怪しいSQLを見つけられる? Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  20. 20. Mahoutでクラシフィケーション • クラシフィケーション(分類)の実行手順 1. 2. 3. 4. 5. 6. 訓練用データを人手で分類 シーケンスファイルへ変換 ベクトルデータに変換 訓練用データとテスト用データに分割 訓練してモデルを構築(train) モデルのテスト 今回は分類器にNaive Bayes(単純ベイズ)を使用 → スパムフィルタでよく使われるよ! Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  21. 21. データの分類、変換 • 訓練用データを人手で分類 tpch trainSql tpcc suspicious • シーケンスファイルに変換 $ mahout seqdirectory -i trainSql -o trainSeq 中身はこんな感じ Key: /tpcc/part-00000: Value: SELECT /* N-07 */ s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_data FROM stock WHERE s_i_id = :1 AND s_w_id = :2 FOR UPDATE UPDATE /* N-08 */ stock SET s_quantity = :1 , s_ytd = s_ytd + :2 , s_order_cnt = s_order_cnt + 1, s_remote_cnt = s_remote_cnt + :3 WHERE s_i_id = :4 AND s_w_id = :5 INSERT /* N-09 */ INTO order_line (ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_delivery_d, ol_quantity, ol_amount, ol_dist_info) VALUES (:1 , :2 , :3 , :4 , :5 , :6 , NULL, :7 , :8 , :9 ) • ベクトルデータに変換 $ mahout seq2sparse -i trainSeq -o trainSparse ¥ -a org.apache.lucene.analysis.WhitespaceAnalyzer 中身はこんな感じ Key: /tpcc/part-00001: Value: {543:26.124736785888672,542:36.76076126098633,541:51.987571716308594,539:116.10087585449219,538:82.09571075439453 ,529:82.09571075439453,528:13.946792602539062,527:13.946792602539062,524:25.92806053161621,523:37.858341217041016 ,522:25.92806053161621,521:11.3875093460083,520:11.3875093460083,519:11.3875093460083,518:25.92806053161621,517:2 6.76988983154297,516:65.2334213256836,514:37.858341217041016,513:37.858341217041016,512:53.53977966308594,501:53. 889198303222656,500:36.94595718383789,499:26.124736785888672,498:52.05316925048828,497:19.723743438720703,496:19. Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  22. 22. モデルの構築・テスト • 訓練用データとテスト用データに分割 $ mahout split -i trainSparse/tfidf-vectors --trainingOutput trainData ¥ --testOutput trainTestData --randomSelectionPct 50 ¥ --overwrite --sequenceFiles --method sequential • 訓練してモデルを構築 $ mahout trainnb -i trainData -o trainModel –li trainIndex -ow -c -el • モデルのテスト $ mahout testnb -i trainTestData -o trainTestResult ¥ -m trainModel -l trainIndex -ow -c テストデータ TPC-H: 3件 select s_suppkey, s_name, s_address, s_phone, total_revenuefrom supplier, revenue0where s_suppkey = supplier_no and total_revenue = ( select max(total_revenue) from revenue0 )order by s_suppkey TPC-C: 4件 SELECT /* N-07 */ s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_data FROM stock WHERE s_i_id = :1 AND s_w_id = :2 FOR UPDATE Copyright © 2013 Insight Technology, Inc. All Rights Reserved. 怪しいSQL: 1件 SELECT C_ID FROM TPCC.CUSTOMER WHERE C_ID=:B3 AND C_D_ID=:B2 AND C_W_ID=:B1
  23. 23. さて、テスト結果は? Summary ------------------------------------------------------Correctly Classified Instances : 8 Incorrectly Classified Instances : 0 Total Classified Instances : 8 100% 0% ======================================================= Confusion Matrix ------------------------------------------------------a b c <--Classified as 1 0 0 | 1 a = suspicious 0 4 0 | 4 b = tpcc 0 0 3 | 3 c = tpch 正答率100%! これで怪しいSQLは見つけられる? Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  24. 24. 実環境に適用するにはまだまだ… 訓練データの 量・質 機械学習の 理解 Mahoutの 運用スキル Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  25. 25. DBエンジニアが元気にHadoopを使うには 1. 「埋もれたダイヤの原石を発掘」の発想で 2. 適材適所。RDBMSの置き換えではなく補完 3. 機械学習はデータが命 DBエンジニアの腕の見せ所 Copyright © 2013 Insight Technology, Inc. All Rights Reserved.
  26. 26. •無断転載を禁ず •この文書はあくまでも参考資料であり、掲載されている情報は予告なしに変更されることがあります。 •株式会社インサイトテクノロジーは本書の内容に関していかなる保証もしません。また、本書の内容に関連したいかなる 損害についても責任を負いかねます。 •本書で使用している製品やサービス名の名称は、各社の商標または登録商標です。

×