9. The Revolution of Big Data
DATA
Hypotheses
Statistical Analysis
BIG DATA
Hypotheses
Machine Learning
Data Mining
Machine-generated
Sampling, Multi-variant… All, Hyper space, …
Volume, Velocity, Variety, Veracity
Human-explainable
10. Reports vs. models
Segments Reports
For Human
(Explanatory)
Models Data-driven
Actions
Efficiency Intelligence Effectiveness
Data Science is the art of turning data into actions.
19. 4R: Reach, Richness,
Representation, Range
Reach
Richness
High
High
Low
使用者接觸量 (DAU)
資料豐富度
(Behavioral data)
Range
High
系統範圍
( Affiliate of
whole context)
Representation
呈現形式與內容
(Format &
Content)
20. Data Economy
Traditional -> Internet Economy
HighREACH
RICHNESS
High
Low
Traditional
Economy
Internet Economy
(quality)
(quantity)
21. Reach: The Value Funnel
CPM campaign:
Revenue = N/1000 ⋅CPM
CPC campaign:
Revenue = N ⋅ CTR ⋅ CPC
CPA campaign:
Revenue = N ⋅ CTR ⋅
CVR⋅ CPA
UU Reach (DAU)
ARPU = Life-time Value
24. Richness
• Data Quality Richness
• Data Utilization Richness
– Call taxi (short vs. long route)
– Download times vs. Activation days
• Data Model Richness
29. Matrix = Associations
Rose Navy Olive
Alice 0 +4 0
Bob 0 0 +2
Carol -1 0 -2
Dave +3 0 0
• Things are associated
Like people to colors
• Associations have
strengths
Like preferences and
dislikes
• Can quantify associations
Alice loves navy = +4,
Carol dislikes olive = -2
• We don’t know all
associations
Many implicit zeroes
Source: Sean Owen(2012), Cloudera
30. In Terms of Few Features
• Can explain associations by appealing to underlying features in
common (e.g. “blue-ness”)
• Relatively few (one “blue-ness”, but many shades)
(Alice)
(Blue)
(Navy)
Source: Sean Owen(2012), Cloudera
31. Losing Information is Helpful
• When k (= features) is small, information is lost
• Factorization is approximate
(Alice appears to like blue-ish periwinkle too)
(Alice)
(Blue)
(Navy)
(Periwinkle)
Source: Sean Owen(2012), Cloudera
46. Range
- Roger Martin
Rothman School of Management, Toronto
If only attach importance to quantify the business
model, it will not have the ability to find a potential
growth opportunities: "The pursuit of quantifying
the biggest problem is that people ignore the
context of the behavior generated, detached from
the context of the event, and have not been
included in the model ignores variables
effectiveness. "
企業若只重視量化模式,
將無法擁有尋得潛在成長
契機的能力:「追求量化
最大的問題在於,忽略
人們產生行為的脈絡,
把事件從情境中抽離,
且忽略沒有被納入模式
中的變數效力。」
49. High
4R: Reach, Richness,
Representation, Range
Reach
Richness
High
High
Low 使用者接觸量 (DAU)
資料豐富度
(Behavioral data)
Range
系統範圍
( Affiliate of
whole context)
Representation
呈現形式與內容
(Format &
Content)