20190424 只要會SQL就能做Machine Learning？ BigQuery ML簡介

Proprietary + ConﬁdentialProprietary + Conﬁdential
只要會SQL就能做Machine Learning？
BigQuery ML簡介
Aaron Lee
aaronlee@mitac.com.tw

李東霖 Aaron
現職
● 神通資訊科技Google 解決方案顧問
● Qlik、Sophos產品經理
經歷
● Google Apps認證
● Google雲端平台架構師
● PMP專案管理師
● SAP MM顧問
● Oracle OCP認證
演講/授課經驗
專案管理師協會、靜宜大學、前川科技、毅太科技、水利署、國防大學、玉山銀行、神通
資訊科技、國際演講協會、桃園巿稅務局、外貿協會、亞東氣體

Why BigQuery ML？
用SQL就可以建立和執行ML Model，並且做出預測，讓SQL使用者可以用現有工具
加速開發，不用搬移資料，不用費時建立TensorFlow，讓Machine Learning普及化。

Objectives
● 用sample data建立一個模型，它會預測電商訪客是否下單
● 用 CREATE MODEL 語法建立二元迴歸 (是否)
● 用 ML.EVALUATE 語法評估ML Model
● 用 ML.PREDICT 語法做預測

Always free usage limits
Resource Monthly Free Usage Limits Details
Storage The first 10 GB per month is free. BigQuery ML models and training data stored in BigQuery are included in the
storage free tier.
Queries
(analysis)
The first 1 TB of query data processed
per month is free.
Queries that use BigQuery ML prediction, inspection, and evaluation functions
are included in the analysis free tier. BigQuery ML queries that contain CREATE
MODEL statements are not.
Flat-rate pricing is also available for high-volume customers that prefer a stable,
monthly cost.
BigQuery ML
CREATE MODEL
queries
The first 10 GB of data processed by
queries that contain CREATE MODEL
statements per month is free.
BigQuery ML CREATE MODEL queries are independent of the BigQuery analysis
free tier.

原始資料：電商使用者與是否下單

一、建立Dataset “4bqml_tutorial” (用新的UI)

地點選擇United States
On the Create dataset page:
● For Dataset ID, enter bqml_tutorial .
● For Data location, choose United
States (US). Currently, the public
datasets are stored in the US
multi-region location. For simplicity, you
should place your dataset in the same
location.
On the Create dataset page:
● For Dataset ID, enter bqml_tutorial .
● For Data location, choose United
States (US). Currently, the public
datasets are stored in the USmulti-region
location. For simplicity, you should place
your dataset in the same location.
● Leave all of the other default settings in
place and click Create dataset.

二、建立模型
#standardSQL
CREATE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

BigQuery ML可用的模型類別
● 線性迴歸 linear_reg
● 二元邏輯迴歸 logistic_reg
● 多分類邏輯迴歸 logistic_reg
● K-means分群 kmeans
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create

四、評估模型 Evaluate your model
#standardSQL
SELECT * FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

Data set
http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling/fig-holdout.png

評估結果
When the query is complete, click the Results tab below the query text area. The results should look like the following:

欄位說明
Because you performed a logistic regression, the results include the following columns:
● precision — A metric for classification models. Precision identifies the frequency with which a model was correct
when predicting the positive class. 準確度
● recall — A metric for classification models that answers the following question: Out of all the possible positive
labels, how many did the model correctly identify? 召回度
● accuracy — Accuracy is the fraction of predictions that a classification model got right. 明確度
● f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of the precision and
recall. An f1 score's best value is 1. The worst value is 0.
● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions
are from the correct labels.
● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly
chosen positive example is actually positive than that a randomly chosen negative example is positive. For more
information, see Classification in the Machine Learning Crash Course.

欄位說明公式1
Because you performed a logistic regression, the results include the following
columns:
● precision — 準確度 TP / (TP + FP) 在判斷出來為為陽性的個體中，被正確
判斷為陽性之比率
● recall — 召回度 TP / (TP + FN)，代表在所有實際為陽性的個體中，被正確
判斷為陽性之比率，例如下單的人當中，被正確預測會下單的比率
● accuracy — TN / (TN + FP)，在所有實際為陰性的個體中，被正確判斷為陰性
之比率
● f1_score — Precision 跟 Recall 的調和平均數

Confusion matrix 混淆矩陣
https://www.ycc.idv.tw/confusion-matrix.html

Confusion matrix 混淆矩陣例子：愛滋病預測
True condition 真實情況
True 有愛滋 False 沒愛滋
Predicted Outcome
預測結果
Yes 有愛滋，驗出有愛滋
True Positive
TP
沒愛滋，驗出有愛滋
False Positive
FP
No 有愛滋，沒驗出有愛滋
False Negative
FN
沒愛滋，沒驗出有愛滋
True Negagive
TN

Confusion matrix 混淆矩陣例子：愛滋病預測
True condition 真實情況
True 有愛滋 100人 False 沒愛滋 9900人
Predicted Outcome
預測結果
Yes 有愛滋，驗出有愛滋
True Positive
TP
0人
沒愛滋，驗出有愛滋
False Positive
FP
0人
No 有愛滋，沒驗出有愛滋
False Negative
FN
100人
沒愛滋，沒驗出有愛滋
True Negagive
TN
9900人
假設10000人檢測，模型為：全部的人都沒愛滋

columns:
● precision — 準確度 9900 / (9900 + 100)，99%
● accuracy — 精確度 0 ⇒ 準備度悖論
● Recall - 召回率 0 / ( 0 + 100 ) = 0
● 準確度高沒有用，重點是要驗出有愛滋病的人
計算結果

混淆矩陣用在這個例子：User是否下單
True condition
True 真的有下單 False 沒有下單
Predicted
Outcome
Yes
模型預測會下單
會下單，模型預測會下單
True Positive
TP
不會下單，模型預測會下單
False Positive
FP
No
模型預測不會下
單
會下單，模型預測不會下單
False Negative
FN
不會下單，模型預測不會下單
True Negagive
TN

欄位說明公式1
columns:
● precision — 準確度 TP / (TP + FP)，所有個體中，被正確判斷為陽性之比
率
● recall — 召回度 TP / (TP + FN)，代表在所有實際為陽性的個體中，被正確
判斷為陽性之比率，例如下單的人當中，被正確預測會下單的比率
● accuracy — TN / (TN + FP)，在所有實際為陰性的個體中，被正確判斷為陰性
之比率
● f1_score — Precision 跟 Recall 的調和平均數

欄位說明公式2
● log_loss — The loss function used in a logistic regression. This is the
measure of how far the model's predictions are from the correct labels.
預測結結果接近真實數據的程度

欄位說明公式3
● roc_auc — The area under the ROC curve. This is the probability that a
classifier is more confident that a randomly chosen positive example is
actually positive than that a randomly chosen negative example is positive.
For more information, see Classification in the Machine Learning Crash
Course.
AUC=0.5 (no discrimination 無鑑別力)
0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力)
0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力)
0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)

五、用模型預測結果 by country
#standardSQL
SELECT
country, SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC LIMIT 10

六、預測每個user的購買
#standardSQL
SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT
IFNULL(totals.pageviews, 0) AS pageviews,
fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10

結論
● 你只要會SQL語法就可以用
● 語法簡單，可立即實作
● 資料放美國

參考資料
BigQuery Start
https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start
Machine Learning Crash Course
https://developers.google.com/machine-learning/crash-course/

Proprietary + ConﬁdentialProprietary + Conﬁdential
Thank you
Aaron Lee
aaronlee@mitac.com.tw

20190424 只要會SQL就能做Machine Learning？ BigQuery ML簡介

Recommended

Recommended

More Related Content

Similar to 20190424 只要會SQL就能做Machine Learning？ BigQuery ML簡介

Similar to 20190424 只要會SQL就能做Machine Learning？ BigQuery ML簡介 (20)

Recently uploaded

Recently uploaded (20)

20190424 只要會SQL就能做Machine Learning？ BigQuery ML簡介