Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Consultant &
Data Scientist
趙國仁
Craig Chao
chaocraig@gmail.com
從行動廣告大數據觀點談 Big Data
“BIG DATA” from Mobile Ads Pe...
Prelog – Myths of Big Data
• Big Data, Big Hype?
Prelog – Myths of Big Data
• Machine Learning & Statistics have been
used in many places, nothing new in Big
Data?
Prelog – Myths of Big Data
• Big Data is Hadoop / Open Source?
“BIG” data?
big “DATA”?
Challenges of Big Data - 4V
資料量大 資料多樣性
資料輸入
和處理速度快
資料真實性
4V & Solution Directions
Challenges Directions
Volume Scalability
Velocity Real-time Response
Variety All data
Veracity Kn...
The Revolution of Big Data
DATA
Hypotheses
Statistical Analysis
BIG DATA
Hypotheses
Machine Learning
Data Mining
Machine-g...
Reports vs. models
Segments Reports
For Human
(Explanatory)
Models Data-driven
Actions
Efficiency Intelligence Effectivene...
Data Science Venn Diagram
Stat vs. MLDM?
(Big) Programming?
Tags?
Common Data Categories
• Persona
– Age, Gender, Birth date, City,
…
• Attributes
– Phone brand/model, location,
time, App,...
大數據分析找到更多潛在客群
指定
投遞
指定
排除
曝光
頻次
APP
偵測
投放
情境
投放
廣告
偏好
指定
品牌
粉絲
產品
使用者
興趣
偏好
收集用戶行為數據
Targeting Capability
11
22
33
大數據分析
找...
In-database Processing(MPP)
Computation Performance
Source: Matei Zaharia(2013)
Pricing Engine
Framework
Kafka
HDFSHDFS
Apache
Spark
Apache
Spark
Jenkins
Realtime processors
( Spark Streaming)
Realtime ...
4R: Reach, Richness,
Representation, Range
Reach
Richness
High
High
Low
使用者接觸量 (DAU)
資料豐富度
(Behavioral data)
Range
High
系統...
Data Economy
Traditional -> Internet Economy
HighREACH
RICHNESS
High
Low
Traditional
Economy
Internet Economy
(quality)
(q...
Reach: The Value Funnel
CPM campaign:
Revenue = N/1000 ⋅CPM
CPC campaign:
Revenue = N ⋅ CTR ⋅ CPC
CPA campaign:
Revenue = ...
Richness
Data Quality  Predictive Power
Richness: Predictive Power
APP 類型偏好
使用裝置
使用時間
定位區域
廣告行為偏好
Conversions Logs
Behavioral Data Attribution Data
Richness
• Data Quality Richness
• Data Utilization Richness
– Call taxi (short vs. long route)
– Download times vs. Activ...
Simplest Model
Logistic Regression
LM & LR
Source: http://www.saedsayad.com/logistic_regression.htm
歸一化的好處在於數值具備可比性和收斂的邊界
Likelihood
User-based / Item-based
Recommendation
Matrix = Associations
Rose Navy Olive
Alice 0 +4 0
Bob 0 0 +2
Carol -1 0 -2
Dave +3 0 0
• Things are associated
Like peopl...
In Terms of Few Features
• Can explain associations by appealing to underlying features in
common (e.g. “blue-ness”)
• Rel...
Losing Information is Helpful
• When k (= features) is small, information is lost
• Factorization is approximate
(Alice ap...
Singular Value Decomposition
AA m
=
n
SS
k
k• T’T’
n
m
•ΣΣ
Context-aware Matrix
Factorization
Sample FM Matrix
Optimization Perspective
Gradient Descent
FM with SGD
Rendle, S.(2012)
Factorization Machine - R
• #
• # Factorization machine
• #
• logis <- function(x) {
• result <- 1./(1+exp(-x))
• return(r...
Factorization Machine - R
• FMlogistic_<- function(A, y, At, yt, k, lambda, eta, numiter) {
• #
• # A: input matrix
• # y:...
Factorization Machine - R
• for (iter in 1:numiter) {
• for (i in 1:numinst) {
• for (j in 1:numfeat) {
• w[j] <- w0[j] - ...
Richness
• Data Model Richness
Representation
Representation
Representation
TV campaign
Range
Mobile Campaign Offline Campaign
Reach
Richness
Cross-screen Effect
成功案例:掌握 4R 成效更優異
!
Cross-screen synergy
Big data synergy with Cross-screen effect
+TV
Range
- Roger Martin
Rothman School of Management, Toronto
If only attach importance to quantify the business
model, it wi...
Range
Range
• Google trend, Viral install…
High
4R: Reach, Richness,
Representation, Range
Reach
Richness
High
High
Low 使用者接觸量 (DAU)
資料豐富度
(Behavioral data)
Range
系統...
Big Data - Google Now
Dulingo
Facebook Personal Assistant
全球最先進
的追蹤器:
活動追蹤、
睡眠追蹤、
Smart
Coach 和
心臟健康記
錄
iPaaS 幫助各公司在雲端中及內部部
署連接企業應用程式
癌
症
分
析
視
覺
化
iPod
之父
Tony
Fadell
創建的
恆溫器
智慧家
居公司
醫療資
料的整
合與分
析
政
府
支
出
公
開
平
台
開
車
更
省
油
、
安
全
服務科
技領域
人士的
在線理
財咨詢
管理平
台
World, Model & Theory
Credit: John F. Sowa
BIG
DAT
A
資料始終為了人性
Use Data, not be Used.
Summary - Innovation
謝謝大家!
chaocraig@gmail.com
Data Scientist as CEO of Data
Source: 經理人 (117)
從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data   20150916
Upcoming SlideShare
Loading in …5
×

從行動廣告大數據觀點談 Big data 20150916

151 views

Published on

Only to my friends

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

從行動廣告大數據觀點談 Big data 20150916

  1. 1. Big Data Consultant & Data Scientist 趙國仁 Craig Chao chaocraig@gmail.com 從行動廣告大數據觀點談 Big Data “BIG DATA” from Mobile Ads Perspective
  2. 2. Prelog – Myths of Big Data • Big Data, Big Hype?
  3. 3. Prelog – Myths of Big Data • Machine Learning & Statistics have been used in many places, nothing new in Big Data?
  4. 4. Prelog – Myths of Big Data • Big Data is Hadoop / Open Source?
  5. 5. “BIG” data?
  6. 6. big “DATA”?
  7. 7. Challenges of Big Data - 4V 資料量大 資料多樣性 資料輸入 和處理速度快 資料真實性
  8. 8. 4V & Solution Directions Challenges Directions Volume Scalability Velocity Real-time Response Variety All data Veracity Knowledge engineering & Machine intelligence
  9. 9. The Revolution of Big Data DATA Hypotheses Statistical Analysis BIG DATA Hypotheses Machine Learning Data Mining Machine-generated Sampling, Multi-variant… All, Hyper space, … Volume, Velocity, Variety, Veracity Human-explainable
  10. 10. Reports vs. models Segments Reports For Human (Explanatory) Models Data-driven Actions Efficiency Intelligence Effectiveness Data Science is the art of turning data into actions.
  11. 11. Data Science Venn Diagram Stat vs. MLDM? (Big) Programming? Tags?
  12. 12. Common Data Categories • Persona – Age, Gender, Birth date, City, … • Attributes – Phone brand/model, location, time, App, browser, banner… • Behavior – Click, Conversion(Installation, Cart, Purchase, …), Activation, Payment…
  13. 13. 大數據分析找到更多潛在客群 指定 投遞 指定 排除 曝光 頻次 APP 偵測 投放 情境 投放 廣告 偏好 指定 品牌 粉絲 產品 使用者 興趣 偏好 收集用戶行為數據 Targeting Capability 11 22 33 大數據分析 找到潛在客群 優化投放
  14. 14. In-database Processing(MPP)
  15. 15. Computation Performance Source: Matei Zaharia(2013)
  16. 16. Pricing Engine Framework Kafka HDFSHDFS Apache Spark Apache Spark Jenkins Realtime processors ( Spark Streaming) Realtime processors ( Spark Streaming) DataInjection Speed Layer Batch Layer ServingLayer Kafka DataStreaming CouchbaseCouchbase Docker Container Avro Avro Akka/Scala Actors
  17. 17. 4R: Reach, Richness, Representation, Range Reach Richness High High Low 使用者接觸量 (DAU) 資料豐富度 (Behavioral data) Range High 系統範圍 ( Affiliate of whole context) Representation 呈現形式與內容 (Format & Content)
  18. 18. Data Economy Traditional -> Internet Economy HighREACH RICHNESS High Low Traditional Economy Internet Economy (quality) (quantity)
  19. 19. Reach: The Value Funnel CPM campaign: Revenue = N/1000 ⋅CPM CPC campaign: Revenue = N ⋅ CTR ⋅ CPC CPA campaign: Revenue = N ⋅ CTR ⋅ CVR⋅ CPA UU Reach (DAU) ARPU = Life-time Value
  20. 20. Richness Data Quality  Predictive Power
  21. 21. Richness: Predictive Power APP 類型偏好 使用裝置 使用時間 定位區域 廣告行為偏好 Conversions Logs Behavioral Data Attribution Data
  22. 22. Richness • Data Quality Richness • Data Utilization Richness – Call taxi (short vs. long route) – Download times vs. Activation days • Data Model Richness
  23. 23. Simplest Model
  24. 24. Logistic Regression
  25. 25. LM & LR Source: http://www.saedsayad.com/logistic_regression.htm 歸一化的好處在於數值具備可比性和收斂的邊界 Likelihood
  26. 26. User-based / Item-based Recommendation
  27. 27. Matrix = Associations Rose Navy Olive Alice 0 +4 0 Bob 0 0 +2 Carol -1 0 -2 Dave +3 0 0 • Things are associated Like people to colors • Associations have strengths Like preferences and dislikes • Can quantify associations Alice loves navy = +4, Carol dislikes olive = -2 • We don’t know all associations Many implicit zeroes Source: Sean Owen(2012), Cloudera
  28. 28. In Terms of Few Features • Can explain associations by appealing to underlying features in common (e.g. “blue-ness”) • Relatively few (one “blue-ness”, but many shades) (Alice) (Blue) (Navy) Source: Sean Owen(2012), Cloudera
  29. 29. Losing Information is Helpful • When k (= features) is small, information is lost • Factorization is approximate (Alice appears to like blue-ish periwinkle too) (Alice) (Blue) (Navy) (Periwinkle) Source: Sean Owen(2012), Cloudera
  30. 30. Singular Value Decomposition AA m = n SS k k• T’T’ n m •ΣΣ
  31. 31. Context-aware Matrix Factorization
  32. 32. Sample FM Matrix
  33. 33. Optimization Perspective
  34. 34. Gradient Descent
  35. 35. FM with SGD Rendle, S.(2012)
  36. 36. Factorization Machine - R • # • # Factorization machine • # • logis <- function(x) { • result <- 1./(1+exp(-x)) • return(result) • } • wTx <- function(x, w, V) { #decision value • V.size <- dim(V) • p <- V.size[1] #rows • k <- V.size[2] #columns • tmp = 0; • for (i in 1:k) { • tmp1 <- 0; • tmp2 <- 0; • for (j in 1:p) { • tmp1 = tmp1 + V[j,i] %*% x[j]; • tmp2 = tmp2 + (V[j,i] %*% x[j])^2; • } • tmp = tmp + (tmp1^2-tmp2); • } • tmp = 0.5*tmp; • x[length(x)+1] <- 1 • result <- x %*% t(w) + tmp #x is all features + bias • return(result) • } Un-optimized version
  37. 37. Factorization Machine - R • FMlogistic_<- function(A, y, At, yt, k, lambda, eta, numiter) { • # • # A: input matrix • # y: lable • # At: Test of A • # yt: Test of y • # k: number of latent factors • # lambda: regularization parameters • # eta: learning rate • # numiter: number of interactions • # • • A.size <- dim(A) #[numinst, numfeat] • numinst <- A.size[1] • numfeat <- A.size[2] • nt <- numinst • • #B <- matrix(1, numinst, numfeat) • #B.size <- dim(B) • #Bt <- matrix(1, B.size[1], B.size[2]) • sigma <- 0.1 # standard deviation • # Start here… Model parameter theda = (w0, w, V) • w0 <- matrix(0, 1, numfeat+1) # weights of features, +1 is for bias • w <- matrix(0, 1, numfeat+1) # weights of features, +1 is for bias • #V0 <- matrix(c(rnorm(numfeat*k, mean = 0, sd = sigma)), numfeat, k) # generates an numfeat-by-k output matrix • V0 <- matrix(0.1, numfeat, k) • V <- matrix(0, numfeat, k) # output matrix
  38. 38. Factorization Machine - R • for (iter in 1:numiter) { • for (i in 1:numinst) { • for (j in 1:numfeat) { • w[j] <- w0[j] - eta*((logis(wTx(A[i,], w0, V0) %*% y[i])-1)*y[i]*A[i,j]+2*lambda*w0[j]) • for (numlatent in 1:k) { • ind <- setdiff(1:numfeat, j) • hx <- A[i,j] %*% sum( V0[ind,numlatent] * t(A[i,ind]) ) • V[j,numlatent] <- V0[j,numlatent] - eta*((logis(wTx(A[i,], w0, V0)*y[i])-1)*y[i] * hx + 2*lambda*V0[j,numlatent]) • } • } • w[length(w)] = w0[length(w0)] - eta*((logis(wTx(A[i,], w0, V0)*y[i])-1)*y[i]+2*lambda*w0[length(w0)]) • V0 <- V • w0 <- w • } • yhat <- matrix(0, nt, 1) • for (i in 1:nt) { • yhat[i] <- wTx(At[i,], w, V) • } • prob <- 1./(1+exp(-yhat)); • yhat[yhat>=0] <- 1; • yhat[yhat <0] <- -1; • acc <- sum(yt==yhat)/nt; • cat( sprintf('n#iter = %d, training accurcy = %fn', iter , acc) ) • • } • return( list(prob, yhat) ) • }
  39. 39. Richness • Data Model Richness
  40. 40. Representation
  41. 41. Representation
  42. 42. Representation TV campaign Range Mobile Campaign Offline Campaign Reach Richness Cross-screen Effect
  43. 43. 成功案例:掌握 4R 成效更優異 ! Cross-screen synergy Big data synergy with Cross-screen effect +TV
  44. 44. Range - Roger Martin Rothman School of Management, Toronto If only attach importance to quantify the business model, it will not have the ability to find a potential growth opportunities: "The pursuit of quantifying the biggest problem is that people ignore the context of the behavior generated, detached from the context of the event, and have not been included in the model ignores variables effectiveness. " 企業若只重視量化模式, 將無法擁有尋得潛在成長 契機的能力:「追求量化 最大的問題在於,忽略 人們產生行為的脈絡, 把事件從情境中抽離, 且忽略沒有被納入模式 中的變數效力。」
  45. 45. Range
  46. 46. Range • Google trend, Viral install…
  47. 47. High 4R: Reach, Richness, Representation, Range Reach Richness High High Low 使用者接觸量 (DAU) 資料豐富度 (Behavioral data) Range 系統範圍 ( Affiliate of whole context) Representation 呈現形式與內容 (Format & Content)
  48. 48. Big Data - Google Now
  49. 49. Dulingo
  50. 50. Facebook Personal Assistant
  51. 51. 全球最先進 的追蹤器: 活動追蹤、 睡眠追蹤、 Smart Coach 和 心臟健康記 錄 iPaaS 幫助各公司在雲端中及內部部 署連接企業應用程式 癌 症 分 析 視 覺 化
  52. 52. iPod 之父 Tony Fadell 創建的 恆溫器 智慧家 居公司 醫療資 料的整 合與分 析 政 府 支 出 公 開 平 台 開 車 更 省 油 、 安 全 服務科 技領域 人士的 在線理 財咨詢 管理平 台
  53. 53. World, Model & Theory Credit: John F. Sowa
  54. 54. BIG DAT A 資料始終為了人性 Use Data, not be Used.
  55. 55. Summary - Innovation
  56. 56. 謝謝大家! chaocraig@gmail.com
  57. 57. Data Scientist as CEO of Data Source: 經理人 (117)

×