Create a Machine Learning Model with SQL syntax?BigQuery ML introduction
Now we can only use SQL syntax to create a machine learning model and predict. Don't need to migrate data and create complicated TensorFlow. It's so easy.
用SQL就可以建立和執行ML Model,並且做出預測,讓SQL使用者可以用現有工具加速開發,不用搬移資料,不用費時建立TensorFlow,讓Machine Learning普及化。
8. Always free usage limits
Resource Monthly Free Usage Limits Details
Storage The first 10 GB per month is free. BigQuery ML models and training data stored in BigQuery are included in the
storage free tier.
Queries
(analysis)
The first 1 TB of query data processed
per month is free.
Queries that use BigQuery ML prediction, inspection, and evaluation functions
are included in the analysis free tier. BigQuery ML queries that contain CREATE
MODEL statements are not.
Flat-rate pricing is also available for high-volume customers that prefer a stable,
monthly cost.
BigQuery ML
CREATE MODEL
queries
The first 10 GB of data processed by
queries that contain CREATE MODEL
statements per month is free.
BigQuery ML CREATE MODEL queries are independent of the BigQuery analysis
free tier.
13. 地點選擇United States
On the Create dataset page:
● For Dataset ID, enter bqml_tutorial .
● For Data location, choose United
States (US). Currently, the public
datasets are stored in the US
multi-region location. For simplicity, you
should place your dataset in the same
location.
On the Create dataset page:
● For Dataset ID, enter bqml_tutorial .
● For Data location, choose United
States (US). Currently, the public
datasets are stored in the USmulti-region
location. For simplicity, you should place
your dataset in the same location.
● Leave all of the other default settings in
place and click Create dataset.
14. 二、建立模型
#standardSQL
CREATE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
18. 四、評估模型 Evaluate your model
#standardSQL
SELECT * FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
20. 評估結果
When the query is complete, click the Results tab below the query text area. The results should look like the following:
21. 欄位說明
Because you performed a logistic regression, the results include the following columns:
● precision — A metric for classification models. Precision identifies the frequency with which a model was correct
when predicting the positive class. 準確度
● recall — A metric for classification models that answers the following question: Out of all the possible positive
labels, how many did the model correctly identify? 召回度
● accuracy — Accuracy is the fraction of predictions that a classification model got right. 明確度
● f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of the precision and
recall. An f1 score's best value is 1. The worst value is 0.
● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions
are from the correct labels.
● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly
chosen positive example is actually positive than that a randomly chosen negative example is positive. For more
information, see Classification in the Machine Learning Crash Course.
22. 欄位說明 公式1
Because you performed a logistic regression, the results include the following
columns:
● precision — 準確度 TP / (TP + FP) 在判斷出來為為陽性的個體中,被正確
判斷為陽性之比率
● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確
判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率
● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性
之比率
● f1_score — Precision 跟 Recall 的調和平均數
28. 欄位說明 公式1
Because you performed a logistic regression, the results include the following
columns:
● precision — 準確度 TP / (TP + FP),所有個體中,被正確判斷為陽性之比
率
● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確
判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率
● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性
之比率
● f1_score — Precision 跟 Recall 的調和平均數
29. 欄位說明 公式2
● log_loss — The loss function used in a logistic regression. This is the
measure of how far the model's predictions are from the correct labels.
預測結結果接近真實數據的程度
30. 欄位說明 公式3
● roc_auc — The area under the ROC curve. This is the probability that a
classifier is more confident that a randomly chosen positive example is
actually positive than that a randomly chosen negative example is positive.
For more information, see Classification in the Machine Learning Crash
Course.
AUC=0.5 (no discrimination 無鑑別力)
0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力)
0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力)
0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)
31. 五、用模型預測結果 by country
#standardSQL
SELECT
country, SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC LIMIT 10
33. 六、預測每個user的購買
#standardSQL
SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10