Spark Analytics - スケーラブルな分散処理

Spark Analytics
スケーラブルな分散処理
Azure Databricks と Spark を実装する Azure サービス
Tsuyoshi Matsuzaki
Cloud Solution Architect, One Commercial Partner, Microsoft Japan

Big Data & AI における課題
Silo 化するテクノロジー
Great for Data, but not AI Great for AI, but not for data
Customer
Data
Emails /
Web Pages
Sensor
Data
(IoT)
Video/
Speech
Click
Streams
…

Azure Databricks
ニーズに応じた構築
ロールベースアクセスコントロール (RBAC)
自動スケール
ライブコラボレーション
エンタープライズレベルの SLA
高機能なノートブック
簡易で迅速なジョブスケジューリング
Azure ポートフォリオと柔軟に統合
生産性の向上
セキュアで信頼性あるサービス
リミットなしのスケール
高速で, 容易で, 連携可能な Apache Spark™ ベースの分析プラットフォーム

なぜ Apache Spark か ?
Driver
Executor
Task Task
Executor
Task
Executor Executor
Task Task
分散処理
インメモリ処理
パフォーマンス
インメモリコンピューティングにより Hadoop よりも
高速に実行 (テストケースにより 100 倍以上)
バッチとリアルタイムデータ処理の双方で利用可能

開発生産性
大規模データセットのために設計された簡易な
API 群
100 を超えるの変換処理を提供
Driver
Executor
Task Task
Executor
Task
Executor Executor
Task Task
分散処理
インメモリ処理

エコシステム
さまざまなデータソースのサポート, ISV アプリの
豊富なエコシステム, 開発コミュニティの充実
主要な複数のパブリッククラウド (AWS, Google,
Azure) やオンプレミス・ディストリビュータも既
定でサポート
統一のエンジン
インタラクティブ SQL, ストリーム分析, ML, グラフ処
理のための高レベルのライブラリを含む統一フレー
ムワーク
単一のアプリで、これら複数の処理を同時に組み合
わせることができる。
開発生産性
大規模データセットのために設計された簡易な
API 群
100 を超えるの変換処理を提供

Microsoft Azure における Apache Spark
CONTROL EASE OF USE
Install-based,fully
customized infrastructure
Frictionless & Optimized
Spark clusters
Azure Databricks
IaaS Clusters Managed Clusters
Azure Virtual Machine
(VMSS, VNet, etc)
Workload optimized,
managed clusters
Azure HDInsight
STORAGE
LAYER
ANALYTICS
LAYER
ReducedAdministration
Azure Data Lake Store
Azure Storage

Azure Databricks
Azure リソースマネージャ
ワークスペース、リソース、
ロックされたリソースグルー
プの作成
VM の作成/削除

Unified Analytics Platform
Databricks Workspace
Collaborative Notebooks, Production Jobs
Databricks Runtime
Databricks Cloud Service
Transactions Indexing
ML Frameworks
Blob Storage
Data Lake Storage
AZURE
DATA SOURCES
Event Hub
IoT Hub
Synapse Analytics
Cosmos DB
Azure Data Factory

Databricks Runtime
ML Frameworks
Blob Storage
Data Lake Storage
AZURE
DATA SOURCES
Event Hub
IoT Hub
Synapse Analytics
Cosmos DB
Azure Data Factory
# Read Configuration
readConfig = {
"Endpoint": "https://doctorwho.documents.azure.com:443/",
"Masterkey": "YOUR-KEY-HERE",
"Database": "DepartureDelays",
"Collection": "flights_pcoll",
"query_custom": "SELECT c.date, c.delay, c.origin, c.destination FROM c WHERE c.origin
= 'SEA'" // Optional
}
# Connect via azure-cosmosdb-spark to create Spark DataFrame
flights = spark.read.format(
"com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
flights.count()

Databricks Runtime
ML Frameworks
Blob Storage
Data Lake Storage
AZURE
DATA SOURCES
Event Hub
IoT Hub
Synapse Analytics
Cosmos DB
Azure Data Factory
# Set up the Blob Storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
# Load data from a Synapse Analytics query.
df = spark.read
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "wasbs://<your-container-name>@<your-storage-account-
name>.blob.core.windows.net/<your-directory-name>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("query", "select x, count(*) as cnt from my_table_in_dw group by x")
.load()

Structured Streaming
Data Stream Unbounded Table
Data Stream の新しいデータ = Table に追加される新しい行

共通のプログラミングスタイル
df = (spark.readStream.format("kafka").
option("kafka.bootstrap.servers", "...").
option("subscribe", "topic1, topic2").
option("startingOffsets", "latest").
load()
df = (spark.read.format("csv").
option("header", "true").
option("nullValue", "NA").
option("inferSchema", True).
load("/mnt/flight_weather.csv"))
df = pipelinemodel.transform(df) df = pipelinemodel.transform(df)
new_df = (df.
withWatermark(df.ev_time,"10 minutes").
groupBy(
df.device_id,
window(df.ev_time,"5 minutes")).
count())
new_df = (df.
withWatermark(df.ev_time,"10 minutes").
groupBy(
df.device_id,
window(df.ev_time,"5 minutes")).
count())
(df.write.
mode("overwrite").
parquet("/mnt/test"))
(df.writeStream.
format(“com.databricks.spark.sqldw”).
option("url", "...").
option("tempDir", "wasbs://... ").
option("dbTable", "testTable").
option("checkpointLocation", "/tmp/chk").
start())

Structured Streaming を使った分析 (例)
データソース
Apache Kafka (HDInsight)
Initial Stream Processing
Map, Filter, Join, Windowing, …

データソース
Databricks
# create streaming dataframe from Kafka
df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("startingOffsets", "earliest")
.load()

データソース
Databricks Advanced Analysis
# Watermarking and windowing analysis
analyzed_df = (
df
.withWatermark(df.event_time, "10 minutes")
.groupBy(
df.device_id,
window(df.event_time, "5 minutes"))
.count()
)
...
# Inferencing
analyzed_df = pipelinemode.transform(df)
...

データソース
Databricks
# Sink and Start streaming !
df.writeStream
.format("com.databricks.spark.sqldw")
.option("url", "...")
.option("tempDir", "wasbs://... ")
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "testTable")
.option("checkpointLocation", "/tmp_checkpoint_location")
.start()
Synapse Analytics

Input Stream
(Kafka or Event Hub)
Structured
Streaming
(Databricks)
Synapse
Analytics
File
Cosmos DB
Event Hub
or Kafka
BI
SEMS
BizApp
Function
Grid
dashboard
logging
transaction
alert or workflow
…

Delta Lake
Streaming
Batch
Updates/Deletes トランザクション
ログ
Parquet ファイル

Delta Lake
Streaming
Batch
ログ
CREATE TABLE ...
USING delta
…
dataframe
.write
.format("delta")
.save("/data")
CREATE TABLE ...
USING parquet
...
dataframe
.write
.format("parquet")
.save("/data")
Instead of parquet... … simply say delta

Delta Lake – Reliability
Streaming
● ACID トランザクション
● スキーマエンフォース
● 統合 Batch & Streaming
● Time Travel / ｽﾅｯﾌﾟｼｮｯﾄ
主要機能
高品質で高信頼な
データ
いつでも分析に
活用可能
Batch
ログ

Delta Lake – Reliability
Streaming
● ACID トランザクション
● スキーマエンフォース
● 統合 Batch & Streaming
● Time Travel / ｽﾅｯﾌﾟｼｮｯﾄ
主要機能
高品質で高信頼な
データ
いつでも分析に
活用可能
Batch
ログ
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
spark.read.format("delta").option("timestampAsOf",
timestamp_string).load("/events/")
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS OF
• date_sub(current_date(), 1)
過去のデータの再生成誤った書き込み時のロールバック

Delta Lake – Performance
Streaming
Batch
ログ
Databricks
optimized engine
高いパフォーマンス
スケーラブルなクエリ
● インデクス
● 圧縮
● データスキップ
● キャッシュ
主要機能

MLflow
Tracking
実験内容の記録と検索 :
コード、データ、構成、
結果
Projects
あらゆる環境で実行可能
で再利用可能なパッケー
ジフォーマット
Models
多様なデプロイツールに
モデル送信可能な一般
フォーマット

MLflow
Tracking
実験内容の記録と検索 :
コード、データ、構成、
結果
Projects
あらゆる環境で実行可能
で再利用可能なパッケー
ジフォーマット
Models
多様なデプロイツールに
モデル送信可能な一般
フォーマット
Managed
https://github.com/tsmatz/azure-databricks-exercise

Apache Spark 3.0
• Catalyst optimizer accelerations
• Pluggable Data Catalog
• Spark Graph
• Dynamic Partition Pruning
• Binary format
• recursiveFileLookUp
• pathGlobalFilter
http://spark.apache.org/community.html

Analytics & AI is the #1 investment for business
leaders, however they struggle to maximize ROI
80% 55%
From : “Understanding Why Analytics Strategies Fall Short for Some, but Not Others”
https://azure.microsoft.com/en-us/resources/why-analytics-strategies-fall-short-for-some-but-not-others/

Apache Spark を内包する製品やサービス
• Azure Synapse Analytics
• Azure Data Factory - Mapping Data Flow *
• Azure Data Factory - Wrangling Data Flow
• SQL Server 2019 Big Data Cluster
* : Azure Databricks を使用

Mapping Data Flow
• Resilient data
transformation Flows
• Transform at scale
• Code-free
• Operationalized with
Data Factory

Apache Spark を内包する製品やサービス
• Azure Cosmos DB
• Azure Synapse Analytics
• Azure Data Factory - Mapping Data Flow *
• Azure Data Factory - Wrangling Data Flow
• SQL Server 2019 Big Data Cluster
* : Azure Databricks を使用

Azure Cosmos DB <3’s Azure Synapse
Jupyter Notebook のビルトイン
Analytical Storage のビルトイン
Azure Synapse による Spark 分析
冗長な ETL は不要
Transactional と Analytical
Storage のパフォーマンス分離
グローバルな分散と拡張 (伸縮)
Global distribution, elastic scale, low latency, intuitive consistency modes,
99.999 SLA
Multi-model
Key-value Column-family Document Graph
SQL Cassandra MongoDB Gremlin Table API ETCD
Jupyter NotebooksAzure Synapse - Spark
Transactional store (rows) Analytical store (columns)

Azure Cosmos DB <3’s Azure Synapse

Spark Analytics - スケーラブルな分散処理

More Related Content

What's hot

Similar to Spark Analytics - スケーラブルな分散処理

Spark Analytics - スケーラブルな分散処理