Db2 Warehouse Spark利用ガイドデータ操作編

© 2017 IBM Corporation
Db2 Warehouse Spark利⽤ガイド
データ操作編
IBMシステムズ・エンジニアリング
2017/10/11

© 2017 IBM Corporation2
⽬次
§ Db2 WarehouseとSpark
§ Db2 WarehouseでのSpark処理環境の構築
§ Db2 WarehouseとSparkによるデータの処理

Db2 Warehouseとは
オンプレミス/プライベート・クラウドパブリック・クラウド
• コアとなるデータベー
ス・ソフトウェア
• さまざまな要件、構成
に対応可能
• お客様が⾃由に管理
• アナリティクス向けの
ソリューション
• お客様のインフラにデ
プロイして利⽤
• Dockerによって迅速
かつ簡単に利⽤開始
• お客様が⾃由に管理
• Sparkによるデータ
処理環境が統合された
• フルマネージドのト
ランザクション向け
DBサービス
• ⼩規模構成の仮想マ
シンか⼤規模構成の
物理マシンが⽤意さ
れる
• フルマネージドのア
ナリティクス向けDB
サービス
• SMPとMPP
• SoftLayer/AWSで利
⽤可能
Db2
Db2
Warehouse
Db2
on Cloud
Db2
Warehouse
on Cloud
このガイドの対象
Db2 Warehouseとは、アナリティクス環境を迅速かつ簡単に構築することにフォーカスした
Db2の新しいソリューション。オンプレミスやプライベートクラウド、任意のIaaS上で稼働
できる。

dashDB LocalコンテナdashDB Localコンテナ
Analytics
EngineRelational Engine
CSV,Twitter, 地理データ
オープン・データ
BI/分析アプリケーション
(SPSS/Cognos)
Webコンソール
SQLインターフェイス
Cloudant
（蓄積）
構造化データ
スケーラブルクラスタファイルシステム
Watson IoT/Kafka
(収集、抽出)
共有メモリによる⾼速データ通信
データサイエンティスト
テキスト
ファイル
半構造化、⾮構造データ
オブジェクトストレージ
ストリームデータ
データ可視化、分析
加⼯
加⼯
Db2 Warehouse コンテナ
Db2 WarehouseとSpark
データウェアハウスにSpark分析エンジンを統合することで、半構造、⾮構造化データの
加⼯、分析処理と従来の構造化データ分析処理を単⼀プラットフォームで実現
Db2 Warehouseはコンテナ内にSparkの稼働環境を取り込んでおり、ScalaやPython、Rな
どでSparkによるデータの分散処理機能を利⽤可能
Db2 Warehouseの構築と同時にSpark環境も⾃動的に作成されるため個別に環境を構築する
必要がない

Db2 WarehouseでのSpark処理環境の構築

§ Db2 Warehouseセットアップ時のSparkオプション
§ Db2 WarehouseでSparkを利⽤する⽅法
§ Python環境の整備
§ Jupyter Notebook(対話的開発環境)の導⼊
Db2 WarehouseでのSpark処理環境の構築

Db2 Warehouseセットアップ時のSparkオプション
§ Db2 Warehouseコンテナの作成時（docker run実行時）に有効化/無効化を
指定する
[root@node1i:/root]# docker run -d -it --privileged=true --net=host --name=dashDB -v
/mnt/clusterfs:/mnt/bludata0 -v /mnt/clusterfs:/mnt/blumeta0 -e DISABLE_SPARK='NO' -e
TIMEZONE='Asia/Tokyo' ibmdashdb/local:latest-linux
docker run コマンドで指定できるオプションの⼀覧は以下を参照。
Configuration options for the IBM Db2 Warehouse image
https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.doc/admin/local_configuring.html#local_configuring
§ Spark用に確保されるメモリは、ホストシステムのメモリ量に依存する
System Memory Spark Application Memory
< 128GB 10%
≦ 128GB < 256GB 15%
≧ 256GB 20%
⼀度デプロイした環境であっても、コンテナー削除後にdocker run -e option=valueを実⾏することでデータを
削除せずに設定値を上書きできる。(ENABLE_ORACLE_COMPATIBILITY、TABLE_ORG を除く、ほとんどの項⽬が変
更可能)
https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.doc/admin/configuring_Local.html
初回作成時にDISABLE_SPARK
='YES'を指定しても、後で変更可
※ Sparkはデフォルトで有効化されるため、DISABLE_SPARKオプションなしでの実行も可能

Db2 WarehouseでSparkを利用する方法(1/3)
§ IDAX.SPARK_SUBMIT
- SQLインターフェースからCALL可能なストアード・プロシージャ
CALL IDAX.SPARK_SUBMIT(?, 'appResource=idax_examples.jar |
mainClass=com.ibm.idax.spark.examples.ReadWriteExampleKMeans')
• パラメーターをパイプ(|)区切りで指定する場合の例：
CALL IDAX.SPARK_SUBMIT(?, '{ "appResource" : "idax_examples.jar", "mainClass" :
"com.ibm.idax.spark.examples.ReadWriteExampleKMeans" }');
• パラメーターをJSON(JavaScript Object Notation)形式で指定する場合の例：
CALL IDAX.SPARK_SUBMIT(out submission_id varchar(1024)
, in parameters varchar(32672)
, in configuration varchar(32672) default null)
• シンタックス
submission_id：Sparkアプリケーションの
サブミッションID。自動的に付与される。
parameters：Sparkアプリケーションのリソース
(JAR, py, Rファイルなど)、mainクラス、引数などを
指定する。
configuration：パラメータのフォーマット
(json,pipe,auto)、モード(sync/async)、リトライ回数
などをkey=value形式で指定する。(オプショナル)
具体的な実行イメージや、実行中JOBのモニター方法などは、当資料の「チュートリアル」→「SQLによるSparkへの処理リクエスト」を参照すること。

Db2 WarehouseでSparkを利用する方法(2/3)
§ spark-submit.sh
- Shellスクリプトが実行可能な環境から Scala, Java, R, Pythonアプリケーションを実行・管理
>>-spark-submit.sh---------------------------------------------->
>--+-+-file_name--+-----------+--| application options |--------------------------------------------+-+------------+-+-><
| | '-arguments-' | '- --jsonout-' |
| +- --load-samples------------------------------------------------------------------------------+ |
| | .-apps--------. | |
| +- --upload-file--+-------------+--source_path--+--------------------+-------------------------+ |
| | +-defaultlibs-+ '- --user--user_name-' | |
| | '-globallibs--' | |
| | .-apps--------. | |
| +- --download-file--+-------------+--file_name--+--------------------+--+--------------------+-+ |
| | +-defaultlibs-+ '- --user--user_name-' '- --dir--target_dir-' | |
| | .-apps--------. | |
| +- --list-files--+-------------+--+--------------------+---------------------------------------+ |
| | .-apps--------. | |
| +- --delete-file--+-------------+--path--+--------------------+--------------------------------+ |
| +- --cluster-status----------------------------------------------------------------------------+ |
| +- --app-status--submission_ID-----------------------------------------------------------------+ |
| +- --list-apps---------------------------------------------------------------------------------+ |
| +- --download-cluster-logs--+--------------------+---------------------------------------------+ |
| | '- --dir--target_dir-' | |
| +- --download-app-logs--+---------------+--+--------------------+------------------------------+ |
| | '-submission_ID-' '- --dir--target_dir-' | |
| '- --kill--submission_ID-----------------------------------------------------------------------' |
'-+- --display-cluster-log--+-out-+--+-master-------------+-+-----------------------------------------------------'
| '-err-' '-worker--IP_address-' |
+- --display-app-log--+-app--+--+---------------+---------+
| +-out--+ '-submission_ID-' |
| +-err--+ |
| '-info-' |
+- --webui-url--------------------------------------------+
+- --env--------------------------------------------------+
+- --version----------------------------------------------+
'- --help-------------------------------------------------'
• シンタックス
application options
|--+- --class--main_class--+------------------------+-+--------->
| | .-,---------. | |
| | V | | |
| '- --jars----file_name-+-' |
'-+----------------------------+-------------------'
| .-,---------. |
| V | |
'- --py-files----file_name-+-'
.- --name--application_id-. .- --loc--host---.
>--+-------------------------+--+----------------+-------------->
'- --name--name-----------' '- --loc--client-'
>--+---------------------------------------------+--------------|
'- --master--+-https://--dashDB_host--:8443-+-'
'-local------------------------'
ファイル操作
ログ確認
実行するアプリケーション
の情報を指定
Spark Web User Interface (UI)
の表示
ステータス確認
具体的な実行イメージや実行例は、当資料の「チュートリアル」→「RESTツールの準備」～「アプリケーションの実行」を参照すること。
処理の中止

Db2 WarehouseでSparkを利用する方法(3/3) ①
§ REST API
- cURLやその他のRESTクライアントツールから実行可能
GET /global List the contents of the global administrator directory
GET /global/{file_or_folder} Get the contents of a file or list the contents of a folder relative to the global administrator directory
POST /global Upload a file to the global administrator directory
POST /global/{folder} Upload a file to a folder relative to the global administrator directory
DELETE /global/{file_or_folder} Delete a file or a folder in the global administrator directory
• global : グローバル管理ディレクトリの操作(/mnt/clusterfs/global) ※管理者のみ実行可
GET /home List the contents of the home directory
GET /home/{file_or_folder} Get the contents of a file or list the contents of a folder relative to the home directory
POST /home Upload a file to the home directory
POST /home/{folder} Upload a file to a folder relative to the home directory
DELETE /home/{file_or_folder} Delete a file or a folder in the home directory
• home : ホームディレクトリの操作
[BASE URL: /dashDB-api, API VERSION:1.1.0]
GET /load/{loadID} Get information on load jobs based on loadID
GET /load/{tableName} Get information on load jobs based on tableName
POST /load/local/del/{tableName} Load local delimited data into a table
• load : データのLOAD
対象テーブル名、LOAD元ファイル名
(REST API実行クライアント上のパス名)を
実行時に指定する
üDatabase API

Db2 WarehouseでSparkを利用する方法(3/3) ②
POST /rscript Run a temporary R script file
POST /rscript/{filename} Run an existing R script file
• rscript : R スクリプトの実行
POST /users Create an LDAP user
• users : LDAPユーザー操作 ※管理者のみ実行可
https://developer.ibm.com/static/site-id/85/api/db2wh/#analytics
[BASE URL: /dashDB-api, API VERSION:1.1.0]
RスクリプトのBodyを実行時に指定する
POST /public/apps/cancel Cancel a Spark application
POST /public/apps/submit Submit a Spark application
• apps : Sparkアプリケーションの操作
GET /public/monitoring/app_status Check Spark applications that are currently running
• monitoring: 実行中のSparkアプリケーションのモニター
POST /public/samples/load Load the Spark samples into the user's home directory
• samples: Sparkサンプルファイルのロード
üAnalytics API
https://developer.ibm.com/static/site-id/85/api/db2wh/#/
APIコマンドの詳細は以下を参照。下記URL内の各コマンドをクリックすると、詳細説明やcRULコマンドの雛形が表示される。
注：Db2 Warehouseは、デフォルトでは自己署名証明書を使用するため、cURL実行時に証明書エラーに抵触した場合は、"-k"オプション(証明書に
関する検証エラーを許容)を指定する必要がある。

Python環境の整備 (1/2)
§ Pythonパッケージの導入
- 必要に応じ、pip(Pythonのパッケージ管理コマンド)をダウンロード
- ホストOSから、Db2 Warehouseのコンテナーにログインしてコマンドを実行
• Db2 Warehouseでは、全ユーザーにPythonパッケージ使用を許可する(root権限が必要)
• Db2 Warehouse docker コンテナー内でコマンドを実行する
https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.doc/learn_how/deploying_python.html
【Jupyter Notebook用】
docker exec -it dashDB bash
/usr/bin/pip install <package name>
exit
インストール手順の詳細は以下を参照。
注1：Db2 Warehouseのdockerコンテナーを再デプロイした場合には、再度、Pythonパッケージをインストールすること。
【Spark applications用】
/usr/local/bin/pip2.7 install <package name>
exit
必要に応じ、どちらか一方、ある
いは両方実行すること
注2：MPP環境の場合、Spark Application用パッケージのインストールは全てのノードで実行すること。

Python環境の整備 (2/2)
§ Pythonパッケージの管理
- インストール済みのパッケージは「freeze」や「list」で確認することができる
/usr/local/bin/pip2.7 list --format=columns # Spark Application用パッケージの確認例
/usr/bin/pip list --format=columns # Jupyter Notebook用パッケージの確認例
exit
[root@node1i:/root]# docker exec -it dashDB bash
[root@node1i - dashDB /]# /usr/bin/pip list --format=columns
Package Version
------------------- -----------
certifi 2017.7.27.1
chardet 3.0.4
cycler 0.10.0
matplotlib 2.0.2
numpy 1.13.1
...(略)...
• 出力例
インストール済パッケージ、
バージョンが表示される

Jupyter Notebook(対話的開発環境)の導入 (1/4)
§ Db2 Warehouseが提供するSpark Application開発用コンテナー
- Spark - Scala, Spark - Pythonの開発・実行環境として利用可能
- オープンソースのJupyterをベースにDb2 Warehouse用のDockerコンテナーが提供される
- Spark driverやexecutorはNotebookコンテナーではなくDb2 Warehouse上で稼動する
§ Jupyter Notebook とは
- 多様なプログラミング言語に対応したWebベースの対話的開発環境で、データ・サイエン
スの領域で活発に利用されている。
- ソースコードと数式、図、説明文などを一体で管理し、共有することが可能
- 作成したコードをLaTeXやJavaScript などのリッチな形式で出力できる
ホストサーバー(Linuxなど)
Db2 Warehouse コンテナー
Jupyter
Notebook
コンテナー
Db2 Warehouseコンテナーとは別に
Jupyter Notebookコンテナーを構築するSpark本体はDb2 Warehouse上(DBエンジ
ンと同じコンテナー)で実行される

§ Jupyter Notebook コンテナーの導入方法について
- SMPの場合はDb2 Warehouse導入先、MPPの場合はヘッドノード導入先のホストに
コンテナーを作成する
- notebookファイルをJupyter Notebookコンテナーの外部に保管するためには、
docker run -v にて jovyan (Jupyterユーザー) のホーム・ディレクトリーとDb2
Warehouseのユーザー・ホーム・ディレクトリを紐付ける
• 外部ボリュームを設定しない場合、コンテナー再起動時にnotebookファイルが初期化される
→ この文書では、bluuser1というユーザーで、外部ボリュームを利用する手順を示す。
§ Jupyter Notebook コンテナー導入手順
cd <任意のディレクトリ>
git clone https://github.com/ibmdbanalytics/dashdb_analytic_tools.git
l Jupyterから利用するDb2 Warehouseユーザーを作成済みであること(コンソール画面より "Settings"
→ "Users and Privileges" → 「＋」("Add User") にて作成 )
l Git clientが導入済みであること (gitコマンドが使用可能であること)
l rootユーザーで実行する場合、umask 0022 であること ("umask"コマンド出力にて確認)
カレントディレクトリに以下のサブディレクトリが作成される
dashdb_analytic_tools/dashdblocal_notebooks
2. リポジトリのダウンロード
1. 前提

docker inspect dashDB | grep -B 1 'Destination.*/blumeta0'
【実行例】
[root@node1i:/root]# docker inspect dashDB | grep -B 1 'Destination.*/blumeta0'
"Source": "/mnt/clusterfs",
"Destination": "/mnt/blumeta0",
ホームディレクトリは「/mnt/clusterfs」を
使用していることが分かる
3. ホーム・ディレクトリのマウント位置を確認
4. Db2 Warehouseコンテナー内のユーザーID(UID)を確認
docker exec -t dashDB /usr/bin/id -u bluuser1
【実行例】
[root@node1i:/root]# docker exec -t dashDB /usr/bin/id -u bluuser1
5003 bluuser1のUIDは5003番である
ことが分かる
5. Jupyter Notbookコンテナーをフォアグラウンドで始動する
(初回は"-it --rm"(フォアグラウンド起動)が推奨される。Ctrl+Cなどで停止可能)
docker run -v /mnt/clusterfs/home/bluuser1/work:/home/jovyan/work -e NB_UID=5003 --user=root
-e DASHDBUSER=bluuser1 -e DASHDBPASS=<bluuser1のパスワード> -it --rm --net=host
dashdblocal_notebook bluuser1の作成時に設定した
パスワードを指定

https://github.com/ibmdbanalytics/dashdb_analytic_tools/tree/master/dashdblocal_notebooks
複数ユーザー対応やポートの変更、PowerPCでのデプロイ方法等、インストール手順の詳細については以下を参照。
6. Jupyter Notbookコンテナーをバックグラウンドで始動する
コマンドオプションの「-it --rm」を「-d」に変更する
docker run -v /mnt/clusterfs/home/bluuser1/work:/home/jovyan/work -e NB_UID=5003 --user=root
-e DASHDBUSER=bluuser1 -e DASHDBPASS=<bluuser1のパスワード> -d --net=host
dashdblocal_notebook
オプションを変更
7. ブラウザー経由でJupyter Notbookコンテナーへアクセスする
http://<ホストOSのIPアドレス>:8888
注： PCクライアントからFW越しにJupyter Notebookコンテナーへアクセスする際は、必要に応じてSSHポート転送を設定すること。
(例えば、PCクライアント側で、Jupyter Notebookサーバーの"8888"ポートをPCクライアント側の"19999"ポートに転送するなど)
bluuser1の作成時に設定した
パスワードを指定
コンテナー作成時に指
定したユーザーで接続

Db2 WarehouseとSparkによるデータの処理

Db2 WarehouseとSparkによるデータの処理
§ Db2 Warehouseでのデータの分散配置
§ Db2のデータをSparkで処理する（データ読み込み）
§ Sparkで処理したデータをDb2に取り込む（データ書き込み）
§ データの並列処理
§ Db2への処理オフロード

Db2 Warehouseでのデータの分散配置
n RDB側のデータはdata partitionで構成され、それぞれのdata partitionが1/n
のデータを分散して保持する
n data partitionの分散数は24もしくは60（シングルサーバーでは1)
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Partition 23
Partition 22
24のData
Partition
・
・
・
Partition 5
Partition 6
24のプロセス
Partition 21
db2sysc 0
db2sysc 1
db2sysc 5
db2sysc 4
db2sysc 3
db2sysc 2
db2sysc 6
db2sysc 21
db2sysc 22
db2sysc 23
照会に対しては
各パーティションが
1/24のデータを処理
それぞれのプロセス
がメモリーを確保
x台のサーバー
（例では4台）
db2sysc 0
db2sysc 1
db2sysc 2
db2sysc 3
db2sysc 4
db2sysc 5
db2sysc 6
.
.
db2sysc 18
.
db2sysc 21
db2sysc 22
db2sysc 23
任意の数のサーバー
に格納する
・
・
・
・
・
・
24セットの
DBファイル群
・
・
・
・
・
・
ストレージ
共有DISK上に配置
照会を発⾏

Db2内部で構造化データはどのように分散されるか
n テーブルごとに決める「分散キー」のハッシュ値で格納先のdata partitionが決
まる
レコード1
分散キーを取り出し
てハッシュ値を計算
a8db4f
データ投⼊時にレコードの値で
配置先が決まる
レコード2
c8cbd1
Partition 16
ハッシュ値を元に格
納先を決定
Partition 5
create table sales (
store_id bigint,
order_date timestamp,
shipping_id bigint,
shipping_method char(20),
mix_cntl int ,
mix_desc char(20) ,
mix_chr char(9) ,
mix_ints smallint ,
mix_tmstmp timestamp )
distribute by hash (store_id)
テーブルを作成するときに
分散キーを指定する
レコード3
11ed8f
Partition 3

データをDb2と共有ストレージのどちらに置くべきか
n Db2側で保持しても、ファイルのまま共有ファイルシステムに保持しても良い
n データの特性と⽤途によって、どちらで保持するかを決める
xxx,yyy,zzz,111
xxx,yyy,zaz,121
xxx,yyy,zaz,113
xxx,yyy,zpz,114
xxx,yyy,zyz,161
csv
{
"name": "db-server",
"chef_type": "role",
"json_class":
"Chef::Role",
"ibm": {},
}
JSON/XML
2017-08-08 15:26:51.097957: I
tensorflow/core/common_runtim
e/gpu/gpu_device.cc:961] DMA:
0
2017-08-08 15:26:51.097962: I
tensorflow/core/common_runtim
e/gpu/gpu_device.cc:971] 0: Y
2017-08-08 15:26:51.097972: I
tensorflow/core/common_
log
RDB
共有ファイルシステム
C1 C2 C3 C4 C5 C6 C7 C8C1 C2 C3 C4 C5 C6 C7 C8 C1 C2 C3 C4 C5 C6 C7 C8C1 C2 C3 C4 C5 C6 C7 C8
n ある程度データ形式が決まっている
n 全体をスキャンせずに⾼速に⼀本釣りしたい
n SQLで強⼒にサマリーしたい
n データがどんどん変わるので形式を固定
したくない
n Pythonなどで直接処理する⽅が便利
n とりあえずファイルだけ放り込んで貯め
ておきたい

Db2のデータをSparkで処理する
（データ読み込み）

Db2のデータをSparkに取り込む(1/2)
Sparkを利⽤してテーブル内の条件に合致するレコード件数をカウントする処理の流れ（Python）
# PySparkのライブラリーをインポート
from pyspark.sql import SparkSession
# インポートしたSparkSessionを利⽤して新しいセッションオブジェクトを作成
sparkSession = SparkSession.builder.getOrCreate()
# アクセスするテーブル名をスキーマ付きで定義する
table = "TUKI.SPARK_TEST_DATA"
# SparkSessionのread関数を利⽤してDataFrameを定義する（ここでは定義だけ）
df = sparkSession.read ¥
.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable=table, sqlpredicate="ID < 10") ¥
.load()
# DataFrameの”count()”関数を利⽤して、条件に合致する件数をカウントする処理を実⾏
# （ここで実際の処理が動く）
resultCount = df.count()
# 取得した件数カウント結果を出⼒
print resultCount
formatの記述によってDb2へ
のアクセスであることを⽰す
このタイミングでDb2に下のような
SQL処理が発⾏されている
SELECT count_big(*) FROM TUKI.SPARK_TEST_DATA
WHERE (dbpartitionnum("ID") = ? selectivity
0.000000000000000001) AND (ID < 10)
実⾏する際は任意の表名に変
更する
絞り込み条件（WHEREに相
当）はoptionsに記述する

Db2のデータをSparkに取り込む(2/2)
対話的開発環境（Jupyter Notebook）での件数カウント処理の実⾏例

Spark DataFrameを利⽤したデータ処理
SELECT相当の操作を実⾏する
table1 = 'TUKI.SPARK_T7'
.options(dbtable=table1, sqlpredicate="C1 < 500") ¥
.load()
print("カウント")
print(df.count())
print("単純なSELECTで先頭20件を表⽰")
print(df.show())
print("単純なSELECTで表⽰レコード数を明⽰的に指定")
print(df.show(50))
print("カラムを指定してソート済みのレコードを取得")
resultOrderBy = df.select("ID", "C3").orderBy("C3")
print(resultOrderBy.show())
print("distinct相当の操作")
print(df.select("c2").distinct().show())
Spark DataFrameのshow()
メソッドを利⽤して
DataFrameの内容を表⽰する
前の例と同様にDataFrame
の定義までを実施
通常のPythonオブジェクトと
同様にメソッドのネスト呼び
出しが可能
ここではselectでカラムの指
定、orderByでデータのソー
トを指定している

DataFrameのカラム構成を取得
['C1', 'C2', 'C3', 'C4', 'C5', 'C6']
print("DataFrameのスキーマを取得")
print(df.printSchema())
DataFrameのスキーマを取得
root
|-- C1: long (nullable = true)
|-- C2: integer (nullable = true)
|-- C3: integer (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
SELECT相当の操作を実⾏する
table1 = 'TUKI.SPARK_T7'
.options(dbtable=table1, sqlpredicate="C1 < 500") ¥
.load()
print("DataFrameのカラム構成を取得")
print(df.columns)
DataFrameを構成するから無
名が取得できる
前の例と同様にDataFrame
の定義までを実施
Db2からDataFrameに取得さ
れたスキーマ（カラム名、
データタイプ、null可否）が
取得できる

Inner join
+---+--------------------+-------+---+--------------------+-------+
| ID| C2| C3| ID| C2| C3|
+---+--------------------+-------+---+--------------------+-------+
| 1|t1 c2 record | 39398| 1|t2 c2 record |3789398|
| 3|t1 c2 record | 398| 3|t2 c2 record | 8|
| 2|t1 c2 record |3649398| 2|t2 c2 record | 19398|
+---+--------------------+-------+---+--------------------+-------+
JOIN相当の操作を実⾏する（1/2）
table1 = 'TUKI.SPARK2_T1'
table2 = 'TUKI.SPARK2_T2'
df1 = sparkSession.read.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable=table1).load()
df2 = sparkSession.read.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable=table2).load()
print("Inner join")
df_joined = df1.join(df2, df1["id"] == df2["id"], "inner")
df_joined.show()
この結合指定での動き
• df1を基準にdf2をjoinする
• 結合条件にはID列を利⽤する
• “inner”オプションによって内部結合
を指定したため、IDが⼀致するレ
コードだけが出⼒されている
2つの表のDataFrame
を定義する
df_joinedの定義時点ではSQLは発⾏さ
れず、show()の実⾏時点でデータが取
得される

Left outer join
+---+--------------------+-------+----+--------------------+-------+
| ID| C2| C3| ID| C2| C3|
+---+--------------------+-------+----+--------------------+-------+
| 4|t1 c2 record | 248|null| null| null|
+---+--------------------+-------+----+--------------------+-------+
JOIN相当の操作を実⾏する（2/2）
# (前ページの続き）
print("Left outer join")
df_joined = df1.join(df2, df1["id"] == df2["id"], "left_outer")
df_joined.show()
“left_outer”オプションによってdf1を
基準とした左外部結合を指定
df2に該当するレコードが存在しなくて
も、df1に存在する列が出⼒されている
print("Left outer join")
df_joined = df1.join(df2, df1["id"] == df2["id"], "left_outer")
df_joined.show()
Full outer join
+----+--------------------+-------+----+--------------------+-------+
| ID| C2| C3| ID| C2| C3|
+----+--------------------+-------+----+--------------------+-------+
|null| null| null| 5|t2 c2 record | 948|
| 4|t1 c2 record | 248|null| null| null|
+----+--------------------+-------+----+--------------------+-------+
“full_outer”オプションによって完全外
部結合を指定
df1/df2それぞれに対応するレコードが
存在しなくても出⼒される

Sparkで処理したデータをDb2に取り込む
（データ書き込み）

Sparkで処理したデータをDb2に取り込む(CSV編) (1/2)
ファイルを準備
[root@node1i:/mnt/clusterfs/home/bluuser1/work/nog]# cat data1.txt
100,A
200,B
Db2に表を準備
Notebookで以下のコードを実⾏(ファイルから読み取りDb2にデータを投⼊できる)
data1 = spark.read.csv("work/nog/data1.txt",mode="DROPMALFORMED",inferSchema=True)
data1.write.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable="BLUADMIN.NOGT1") ¥
.option("allowAppend","true") ¥
.mode("append")¥
.save()
基本的な処理のポイント
①SparkのDataFrameとしてCSVファイルを読み込む
②SparkのDataFrameをそのままDb2に書き込む
通常は①と②の間にデータ加⼯を⾏う
データから⾃動でスキーマを判断する

Sparkで処理したデータをDb2に取り込む(CSV編) (2/2)
データが⼊ったことを確認

Sparkで処理したデータをDb2に取り込む(JSON編) (1/3)
ファイルを準備
[root@node1i:/mnt/clusterfs/home/bluuser1/work/nog# cat sample1.json
{"name":"hoge","age":30,"city":"tokyo"}
{"name":"huga","age":41,"city":"chiba"}
{"name":"hige","age":26,"city":"kanagawa"}
Db2に表を準備
Db2への書き込みは、SparkのDataFrameからのため、
JSONのデータを読みこみ最後にDataFrameになること
を⽬指す。

Notebookで以下のコードを実⾏(spark.read.jsonを使えば、 SparkのDataFrameとしてJSONファ
イルを読むことができる)
sj1 = spark.read.json('work/nog/sample1.json')
以下のようにDataFrameとして読みこめていることを確認できる。
Notebookでファイルを読み込む場合
は、”/mnt/clusterfs/home/<user name>”配下
の相対パスを指定する

このまま(データ加⼯なしで)Db2に⼊れる場合は、CSVのときと同様、Notebookで以下のように実⾏する。
sj1.write.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable="BLUADMIN.NOGT2") ¥
.mode("append")¥
.save()
データが⼊ったことを確認

アプリケーションからのSpark to Db2 の利⽤例：MLlib
ここでは以下のようなユースケースを想定する。
・数学と英語の点数で合否を決める。(単純に合計点で合否を決めない)
・合格基準は過去の8名の成績をモデルにする。
・⽒名、数学の点数、英語の点数が記録されたCSVファイルを読み込み、モデルによって合否判定を⾏
い、結果を添えてDb2の表に書き込む。
1. モデル作成
1.1 8名の成績ファイルを読み込む
training = spark.read.csv('work/nog/score.csv', header=True, inferSchema=True)
ポイント
・合否は「label」という列名で定義する。(次ページで触れ
るPipelineの仕様)

1.2 Pipelineを使⽤してモデルを作成
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
# 特徴抽出
assembler = VectorAssembler(inputCols=["Math", "English"],outputCol="features")
# 学習器の設定
lr = LogisticRegression(maxIter=10)
# 特徴抽出と学習アルゴリズムのフローをPipelineとして登録
pipeline = Pipeline(stages=[assembler, lr])
# 学習を⾏ってモデルを⽣成
model = pipeline.fit(training)
ポイント
・合否判定(0 or 1)なので、ロジスティック回帰を使⽤。
・SparkではPipelineを利⽤してデータの加⼯を記述する⼿法が推奨されるのでPipelineを使⽤する
・Pipelineでは、まず特徴抽出が必要だが、今回の説明変数は、数学と英語の点数、すなわち複数の数値を特徴にするため、
VectorAssemblerを使⽤。

1.3 作成したモデルをテスト
test = spark.createDataFrame([
(“i”, 90,95), <- iさんはおそらく合格しそうな点数
(“j”, 70,65) <- jさんはきわどい点数
], ["id", "Math","English"])
predict = model.transform(test) <- 合否予測が計算される
predict.select(‘id’, ‘Math’, ‘English’,
‘probability’,‘prediction’).show(truncate=False) <- 合否予測から⼀部を取り出し表⽰

2. 作成したモデルを保存
pipeline.write().overwrite().save('work/nog/score_pipeline')
model.write().overwrite().save('work/nog/score_model')
ポイント
・Pipelineも保存が可能。

3. Db2に表を作成

4. 新しいデータで合否判定を⾏いDb2に結果を格納
4.1 データ作成
# cat new_score.csv
id,Math,English
k,40,80
l,60,40
m,55,59
n,100,10
o,20,80
p,67,59
q,40,87
r,76,56
s,20,78
t,70,55
u,34,80

4.2 2で保存したモデルと4.1のデータをロード
from pyspark.ml import PipelineModel
from pyspark.ml import Pipeline
pipe1 = Pipeline.load('work/nog/score_pipeline')
model1 = PipelineModel.load('work/nog/score_model')
new_score = spark.read.csv('work/nog/new_score.csv', header=True, inferSchema=True)

4.3 スコアリングを実施しDb2に格納
predict = model1.transform(new_score)
toDb = predict.select('id', 'Math','English','prediction')
toDb.write.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable="BLUADMIN.NOG_EXAM") ¥
.mode("append")¥
.save()
4.4 データが⼊ったことを確認
ID、数学の成績（MATH_SCORE）
と英語の成績（ENGLISH_SCORE）
に加えて、合否予測のスコアリング結果
（RESULT）がデータベースに格納され
た
model.transformでス
コアリングを実⾏
スコアリング結果から必要
な列だけを取り出し
Db2への書き込み

・合格者だけ取り出す
predict = model1.transform(new_score)
toDb_tmp = predict.select('id', 'Math','English','prediction’)
toDb = toDb_tmp.where(toDb_tmp.prediction == 1)
toDb.write.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable="BLUADMIN.NOG_EXAM") ¥
.mode("append")¥
.save()
toDb = toDb_tmp.where(toDb_tmp.prediction == 1).count()
・合格者の⼈数をカウント
補⾜：Db2に⼊れる前に以下のようなさまざまな前処理をsparkで実施することが可能
前ページの処理に条件抽
出処理を追加

データの並列処理

データの並列処理(DBデータの取得) (1/3)
§ Db2 Warehouse SMP環境における並列処理
- SMP環境では、Db2はMPP構成ではなくシングル構成で作成される
- そのため、Spark側でもDataFrameでデータを取得した時点では分散されていない
- repartitionメソッドを指定することで、DataFrameを明示的に分散し、並列処理させることができ
る
input = sparkSession.read.format("com.ibm.idax.spark.idxsource").
options(dbtable="SPARK_TEST_DATA").
load()
input = input.repartition(10)
§ MPP/SMPとも、Spark EngineとDB間のデータ連携はプロセス間通信を使用して高速に実行される
10パーティションで
並列処理させる場合の例
IDAXデータソース
を使用

Db2 Warehouse MPPクラスター
§ Db2 Warehouse MPP環境での並列処理
- Db2 Warehouseは、複数のサーバーが連携するMPP環境では、24もしくは60のプ
ロセスで分散処理を行う構成で構築される。
- MPP構成でDBデータをSparkに取得した場合、自動的にDb2 Warehouseの各Data
partitionに対応してDataFrameが分散される (24 partitionであれば、24個の
DataFrameに分割される)
- Sparkによる分散処理を活かすため、可能な限りSpark DataFrameのまま処理を進
めることが推奨される
DBデータは分散キー
列のハッシュ値を元に
分散配置されている
パーティション数に
対応した
DataFrameが使用
される
§ MPP/SMPとも、Spark EngineとDB間のデータ連携はプロセス間通信を使用して高速に実行される

inputData = sparkSession.read.format("com.ibm.idax.spark.idaxsource") ¥
.options(dbtable="SPARK_TEST_DATA") ¥
.load()
inputData.count()
• MPP環境でのテーブル件数カウント例
IDAXデータソース
を使用
MPP環境なので
repartition指定は不要
CountRDD:54 - Partition [Database partition 4 stored on host node1i (port 0)] is connected to DB2 member 0
...
MisplacedPartitionChecker:42 - retrieving data of partition [Database partition 1 stored on host node1i (port
0)] from host node1i
...
CountRDD:54 - Count query SELECT count_big(*) FROM BLUADMIN.SPARK_TEST_DATA WHERE (dbpartitionnum("ID") = ?
selectivity 0.000000000000000001) /* <OPTGUIDELINES><REGISTRY><OPTION NAME='DB2_SELECTIVITY'
VALUE='YES'/></REGISTRY></OPTGUIDELINES> */ returned 38 rows
CountRDD:54 - Count query SELECT count_big(*) FROM BLUADMIN.SPARK_TEST_DATA WHERE (dbpartitionnum("ID") = ?
selectivity 0.000000000000000001) /* <OPTGUIDELINES><REGISTRY><OPTION NAME='DB2_SELECTIVITY'
VALUE='YES'/></REGISTRY></OPTGUIDELINES> */ returned 46 rows
• Spark実行ログ (抜粋)
Database Partitionと実
行ノードを自動認識
各Partitionで処理を
分散実行

データの並列処理(CSVファイル) (1/2)
from pyspark import SparkContext
data_file = "work/Tsuji/Test01.csv"
raw_data = SparkContext.textFile(data_file)
raw_data.count()
• CSV件数カウント例
SparkContextを使用
• Spark実行ログ自動的に13のタスクに分割
され、3ノードで実行される
入力ファイルは約400MB(2000万件)
各ノードのExecutor (並列処理の
ワーカー)がファイルへアクセス
§ 分割可能なファイルの分散処理 (自動)
- プレーンテキストやbzip2は分割可能(splittable)であり、自動的に並行処理される

データの並列処理(CSVファイル) (2/2)
• Spark実行ログ (続き)
13タスクで件数をカウントし、
合計値を呼び出し元に返す
各ノードで処理を分散
注： gzipファイルは分割可能ではないため、自動並列処理は行なわれない。圧縮率や実行速度を鑑み、適切な形式(圧縮/非圧縮)を選択すること。

Db2への処理オフロード

table = "TUKI.SPARK_T7"
# SparkSessionのread関数を利⽤してDataFrameを定義する
.options(dbtable=table, sqlpredicate=“C1 < 10") ¥
.load()
df.count()
# Filterメソッドによる絞り込み
df.filter(df['C2'] < 50).show()
DataFrame定義時にFilter条
件を指定する例
§ SparkからDb2への処理オフロード
- Db2 Warehouseに統合されたSparkでは、フィルター処理のDb2へのオフロードが実
装されている。
- DataFrame定義時のsqlpredicateによる条件の指定や、DataFrameのfilterメソッド
利用時に有効になる
Filterメソッドによる絞り込
みも可能

member STMT_TEXT
------ -------------------------------------------------------------------------------------------------------------------------
0 SELECT count_big(*) FROM TUKI.SPARK_T7 WHERE (dbpartitionnum("C1") = ? selectivity 0.000000000000000001) AND (C1 < 10)
24のDB partitionそれぞれ
にSQLが発⾏されている
df.count()実行時のSQL
DataFrame作成時に定義し
たsqlpredicateがSQLに埋め
込まれている

member STMT_TEXT
------ -------------------------------------------------------------------------------------------------------------------------
0 SELECT "C1","C2","C3","C4","C5","C6" FROM TUKI.SPARK_T7 WHERE (dbpartitionnum("C1") = ? selectivity 0.000000000000000001) AND (C2 IS NOT NULL) AND (C2 < 50) AND (C1 < 10)
df.filter(df['C2'] < 50).show()実行時のSQL
DataFrame作成時に定義した
sqlpredicateとfilterメソッドによる絞り
込みが共にSQLに埋め込まれている

Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in
your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

Db2 Warehouse Spark利用ガイドデータ操作編

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Db2 Warehouse Spark利用ガイドデータ操作編

Similar to Db2 Warehouse Spark利用ガイドデータ操作編 (20)

More from IBM Analytics Japan

More from IBM Analytics Japan (20)

Recently uploaded

Recently uploaded (10)