Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20190424 只要會SQL就能做Machine Learning? BigQuery ML簡介

763 views

Published on

Create a Machine Learning Model with SQL syntax?BigQuery ML introduction
Now we can only use SQL syntax to create a machine learning model and predict. Don't need to migrate data and create complicated TensorFlow. It's so easy.
用SQL就可以建立和執行ML Model,並且做出預測,讓SQL使用者可以用現有工具加速開發,不用搬移資料,不用費時建立TensorFlow,讓Machine Learning普及化。

Published in: Data & Analytics

20190424 只要會SQL就能做Machine Learning? BigQuery ML簡介

  1. 1. Proprietary + ConfidentialProprietary + Confidential 只要會SQL就能做Machine Learning? BigQuery ML簡介 Aaron Lee aaronlee@mitac.com.tw
  2. 2. 李東霖 Aaron 現職 ● 神通資訊科技Google 解決方案顧問 ● Qlik、Sophos產品經理 經歷 ● Google Apps認證 ● Google雲端平台架構師 ● PMP專案管理師 ● SAP MM顧問 ● Oracle OCP認證 演講/授課經驗 專案管理師協會、靜宜大學、前川科技、毅太科技、水利署、國防大學、玉山銀行、神通 資訊科技、國際演講協會、桃園巿稅務局、外貿協會、亞東氣體
  3. 3. Why BigQuery ML? 用SQL就可以建立和執行ML Model,並且做出預測,讓SQL使用者可以用現有工具 加速開發,不用搬移資料,不用費時建立TensorFlow,讓Machine Learning普及化。
  4. 4. BigQuery ML GA了!!
  5. 5. 結果......
  6. 6. Objectives ● 用sample data建立一個模型,它會預測電商訪客是否下單 ● 用 CREATE MODEL 語法 建立二元迴歸 (是否) ● 用 ML.EVALUATE 語法 評估ML Model ● 用 ML.PREDICT 語法 做預測
  7. 7. Always free usage limits Resource Monthly Free Usage Limits Details Storage The first 10 GB per month is free. BigQuery ML models and training data stored in BigQuery are included in the storage free tier. Queries (analysis) The first 1 TB of query data processed per month is free. Queries that use BigQuery ML prediction, inspection, and evaluation functions are included in the analysis free tier. BigQuery ML queries that contain CREATE MODEL statements are not. Flat-rate pricing is also available for high-volume customers that prefer a stable, monthly cost. BigQuery ML CREATE MODEL queries The first 10 GB of data processed by queries that contain CREATE MODEL statements per month is free. BigQuery ML CREATE MODEL queries are independent of the BigQuery analysis free tier.
  8. 8. 美國價格,但是......
  9. 9. 台灣價格
  10. 10. 原始資料:電商使用者與是否下單
  11. 11. 一、建立Dataset “4bqml_tutorial” (用新的UI)
  12. 12. 地點選擇United States On the Create dataset page: ● For Dataset ID, enter bqml_tutorial . ● For Data location, choose United States (US). Currently, the public datasets are stored in the US multi-region location. For simplicity, you should place your dataset in the same location. On the Create dataset page: ● For Dataset ID, enter bqml_tutorial . ● For Data location, choose United States (US). Currently, the public datasets are stored in the USmulti-region location. For simplicity, you should place your dataset in the same location. ● Leave all of the other default settings in place and click Create dataset.
  13. 13. 二、建立模型 #standardSQL CREATE MODEL `bqml_tutorial.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
  14. 14. 等很久......
  15. 15. BigQuery ML可用的模型類別 ● 線性迴歸 linear_reg ● 二元邏輯迴歸 logistic_reg ● 多分類邏輯迴歸 logistic_reg ● K-means分群 kmeans https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create
  16. 16. 三、取得訓練結果
  17. 17. 四、評估模型 Evaluate your model #standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, ( SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
  18. 18. Data set http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling/fig-holdout.png
  19. 19. 評估結果 When the query is complete, click the Results tab below the query text area. The results should look like the following:
  20. 20. 欄位說明 Because you performed a logistic regression, the results include the following columns: ● precision — A metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class. 準確度 ● recall — A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify? 召回度 ● accuracy — Accuracy is the fraction of predictions that a classification model got right. 明確度 ● f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of the precision and recall. An f1 score's best value is 1. The worst value is 0. ● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions are from the correct labels. ● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive. For more information, see Classification in the Machine Learning Crash Course.
  21. 21. 欄位說明 公式1 Because you performed a logistic regression, the results include the following columns: ● precision — 準確度 TP / (TP + FP) 在判斷出來為為陽性的個體中,被正確 判斷為陽性之比率 ● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確 判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率 ● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性 之比率 ● f1_score — Precision 跟 Recall 的調和平均數
  22. 22. Confusion matrix 混淆矩陣 https://www.ycc.idv.tw/confusion-matrix.html
  23. 23. Confusion matrix 混淆矩陣例子:愛滋病預測 True condition 真實情況 True 有愛滋 False 沒愛滋 Predicted Outcome 預測結果 Yes 有愛滋,驗出有愛滋 True Positive TP 沒愛滋,驗出有愛滋 False Positive FP No 有愛滋,沒驗出有愛滋 False Negative FN 沒愛滋,沒驗出有愛滋 True Negagive TN
  24. 24. Confusion matrix 混淆矩陣例子:愛滋病預測 True condition 真實情況 True 有愛滋 100人 False 沒愛滋 9900人 Predicted Outcome 預測結果 Yes 有愛滋,驗出有愛滋 True Positive TP 0人 沒愛滋,驗出有愛滋 False Positive FP 0人 No 有愛滋,沒驗出有愛滋 False Negative FN 100人 沒愛滋,沒驗出有愛滋 True Negagive TN 9900人 假設10000人檢測,模型為:全部的人都沒愛滋
  25. 25. Because you performed a logistic regression, the results include the following columns: ● precision — 準確度 9900 / (9900 + 100),99% ● accuracy — 精確度 0 ⇒ 準備度悖論 ● Recall - 召回率 0 / ( 0 + 100 ) = 0 ● 準確度高沒有用,重點是要驗出有愛滋病的人 計算結果
  26. 26. 混淆矩陣用在這個例子:User是否下單 True condition True 真的有下單 False 沒有下單 Predicted Outcome Yes 模型預測會下單 會下單,模型預測會下單 True Positive TP 不會下單,模型預測會下單 False Positive FP No 模型預測不會下 單 會下單,模型預測不會下單 False Negative FN 不會下單,模型預測不會下單 True Negagive TN
  27. 27. 欄位說明 公式1 Because you performed a logistic regression, the results include the following columns: ● precision — 準確度 TP / (TP + FP),所有個體中,被正確判斷為陽性之比 率 ● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確 判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率 ● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性 之比率 ● f1_score — Precision 跟 Recall 的調和平均數
  28. 28. 欄位說明 公式2 ● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions are from the correct labels. 預測結結果接近真實數據的程度
  29. 29. 欄位說明 公式3 ● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive. For more information, see Classification in the Machine Learning Crash Course. AUC=0.5 (no discrimination 無鑑別力) 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力) 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力) 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)
  30. 30. 五、用模型預測結果 by country #standardSQL SELECT country, SUM(predicted_label) as total_predicted_purchases FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY country ORDER BY total_predicted_purchases DESC LIMIT 10
  31. 31. 執行結果
  32. 32. 六、預測每個user的購買 #standardSQL SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country, fullVisitorId FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY fullVisitorId ORDER BY total_predicted_purchases DESC LIMIT 10
  33. 33. 預測結果
  34. 34. 結論 ● 你只要會SQL語法就可以用 ● 語法簡單,可立即實作 ● 資料放美國
  35. 35. 參考資料 BigQuery Start https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course/
  36. 36. Proprietary + ConfidentialProprietary + Confidential Thank you Aaron Lee aaronlee@mitac.com.tw

×