Introduction of big data

Prof. 潘人豪 Pan, Ren-Hao
元智大學•大數據與數位匯流中心

Big Data的緣起
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile or wearable devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
 The progress and innovation is no longer hindered by the ability to collect data
 But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
10
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx

1-Scale (Volume)
 Data Volume
 44x increase from 2009-2020
 From 0.8 zetta bytes to 35zb
 Data volume is increasing exponentially
12
Exponential increase in
collected/generated data
Characteristics of Big Data
1012 1015 1018
1021

2-Complexity (Variety)
 Various formats, types, structures (or
unstructured ones).
 Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dimensional arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types of
data
13
To extract knowledge all these types of
data need to be linked together

3-Speed (Velocity)
 Data is being generated fast and need to be processed fast
 Online Real-time Data Analytics
 Late decisions  missing opportunities
 Examples
 e-Promotions: Based on your current location, your purchase history, what
you like  send promotions right now for store next to you
 Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction
14

Big Data: 3V’s
15

Some Make it 4V’s
16

Application Types - Data at Rest

Application Types - Data in Motion

Harnessing Big Data
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
20

What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
21

25
Case : Google成功預測H1N1在全美的傳播範圍
• 2009年在美國的H1N1爆發幾周前，Google成功預測了H1N1在全美的傳播範
圍，具體到了州還有特定地區，判斷非常及時。
• CDC疾控中心通常只能在流感爆發一兩周之後才可以做到。
• 真正第一次利用搜尋引擎大數據，對疾病控制的預測嘗試。
From: http://blog.sciencenet.cn/blog-291824-644684.html
大數據應用
方法：
 Google發現搜尋流感相關主題的使用者數量與
實際出現流感症狀的人數有著密切關聯。
Google將查詢次數與傳統流感監控系統數據進行
比較，發現某些搜尋關鍵字在流感季節特別熱門。
 因此，只要統計使用者搜尋這些關鍵字的次數，
便能預測全球各個國家及地區的流感疫情發展。 Google的研究結果也獲得《自然》期刊登載。
http://www.google.org/flutrends/intl/en_us/

百度疾病預測
 百度自身資料（搜索、微博、貼吧）與中國疾控中心
（CDC）流感監測資料結合建立預測模型。
 對比CDC提供的流感陽性率（2014.5.25值），絕對誤差
在1%以內城市占比62%，在5%以內的城市占比89%。
 其他疾病依靠百度搜自身資料，用無監督學習模型來預
測疾病熱搜動態的時空變化
http://trends.baidu.com/disease/

“Big data hubris,” or just nitpick !?
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014.
“The Parable Of Google Flu: Traps In Big Data Analysis.” Science 343 (14 March)

Big Data 中社群媒體的角色
 除了搜尋引擎對於疫情的預測，社群媒體如 Twitter、
FB 也逐漸在這場大數據競賽中找到自己的定位。

社群網路於醫療領域
 加州大學洛杉磯分校（UCLA）以 Twitter 的訊息量、
地點，來追蹤性病擴散率與毒品濫用的行為。
 加州大學蒐集了 5 億 5000 萬條「推特文」，使用演算
法篩檢出含有「性」(Sex)、「快感」(get high)的字眼，
並記錄發布內容的地區，最後用統計模型觀測這些區
域是否有 HIV 新病例通報。
 結果發現兩者之間有很顯著的關係，當某地區的推文
呈現很高的「性指數」，HIV 的新感染病例也高。

搜尋引擎與社群網路結合
 將 Google 搜尋引擎與 Twitter 結合，還能精準看出一
些社會風氣的變化。
 兩位美國經濟學家結合兩者資訊，發現當《16歲懷孕》
和《小媽咪》兩部美國影集播放時，青少年懷孕生子
數比例大幅降低。

31
Intel正在與專注Parkinson疾病研究的Michael J.Fox基金合作，開展一項從患者的
可穿戴設備收集的數據中，找到疾病模式的研究。
全球共有500萬人被診斷出患有Parkinson綜合症，是第二大神經退化性疾病。
通過可穿戴設備，研究人員可以遠端監測病人，居住在偏遠地區的人也可以參與。
這種設備有助於大範圍臨床試驗，現在很多Parkinson患者因為附近沒有合適的醫
療機構，無法參與臨床試驗。
相對於患者的主觀描述，可穿戴設備記錄的數據也更加客觀，例如病人可能會告訴
醫生他顫抖了幾分鐘，但實際上可能只能幾秒鐘。
Intel--用大數據解決Parkinson
http://www.36dsj.com/archives/11605

科技隨身用品興起
 高齡化社會，醫療進步，帶動對健康、以及最精密複
雜的人體的重視。
 德國健康生活用品品牌 beurer 推出結合心律偵測的手
錶，打破得知身體狀況需到特定地點以及配戴繁複儀
器的觀念，並結合日常隨身用品，讓科技、健康與生
活結合。

33
FlatironHealth這家醫療保健技術公司位於紐約，成立才剛兩年。
剛拿到Google創投（Google Ventures）的資金投資。
全美癌症患者共有1300多萬人,研究人員和醫生卻只能針對其中一部分人開展研究。
在美國，絕大多數癌症治療經驗都來自於臨床試驗,高達96%的病人不參加這類試驗。
另外96%的病人資訊沉澱在電子病歷（EMR）系統和醫生筆記裡。目標就是收集這96%
的病人的數據，重新整理，讓它們可以為醫生、病人和其他利益相關者所用。
美國醫療保健技術公司FlatironHealth--用大數據戰勝癌症
http://www.36dsj.com/archives/9319
Forrester Research資深分析
師斯基普•斯諾說：“Google
想要的是長生不老。他們深信
自己介入醫療保健領域就是為
了追求長壽——怎麼幫助人們
活得更長久、更健康？”

34
From：http://tieba.baidu.com/p/2900201015
Microsoft大數據成功預測奧斯卡--24中21
Microsoft紐約研究院經濟學家David
Rothschild通過大數據分析，成功預測了2014年
奧斯卡24項獎項中的21項。
2013年David Rothschild預測奧斯卡的獲獎名
單，24中19。
主要依據：
 票房收入、電影評選等這類非統計數據。
 使用Predict Market網站上的資訊。
 User-Generated Data：網友在各個社群媒體上深
入探討入圍電影的內容。
http://www.360doc.com/content/13/0227/22/184879_268325152.shtml

Fashion trends among consumers often
change in the blink of an eye
 Philosophy of Zara
 The apparel industry stresses about the need to react rather
than predict.
 Developed a business model where speed and decentralized
decision-making was essential.
 Zara’s Fast Fashion
 Understanding the items that its customers actually want.

Strategies of Zara
 Vertical Integration
 Small Batch Production
 Collecting Vital Information for Decision Making
 Selling well objects : Type of fabric, cut, and colors
 Quick response to Demand (Pull System/Message Sharing)
 Analyze “Regional Pop”
 Make the market segmentation closest to the customer needs.
 High Product Turnover
 Strong IT System
 Real-time Knowledge(Dataflow) in the entire distribution-to-sale process

Product
• Quick Change
Artist
Production
• Supply Chain
Management
Logistic
• Inventory
Workflow
Innovation
Selling
• Real-time
Customer
Service
• Online Shop

Data Flow in Quick Change Artist

Inventory Workflow Innovation
 High-velocity shipping: Rapid Information flows
 Stores: Electronically connected to headquarters
 Logistics system: Speed and flexibility
 Products:
 Selected
 Sorted
 Routed
 Delivered
 Local distribution center
 Retail store stockrooms

Customer Service
O2O : Offline to Online, vice versâ

Degree of System Implementation at Zara

Zara Online Shop
 Collect feedback to manufacturing
 Find out the target market exactly
 Held consumer opinion survey,
capture customer feedback to
improve the actual shipping
products

vs. Big Data
Information Integration, Focus on customer requirement, Decentralized decision-making
In-store Online Shop
Customer Behavior
PoS
Click Tracking
Online Fourm
Consumer survey
DATA
Daily Report
High-velocity
shipping
Prototype Survey
Real-Time Data
Fashion Analysis
market segmentation
Quick Change Artist
Agile Management

Analytical Culture in Zara
Online Retail Websites KPIs Company Marketing KPIs
 Purchase conversion
 Average Order Size
 Items per Order
 Purchase dropouts rate
 Effect on offline sales
 Returned items rate
 Response rate by segment
 Response rate by the marketing
media
 Response rate by marketing
message
 Cost per marketing
campaign/cost per sale
 Revenue per marketing
campaign/revenue per sale
Company Strategic KPIs
 Ratio of winning designs
 Ratio of cross-brand conversions (in INDITEX retail group)

馬雲的判斷來自於數據分析
“2008年初,阿裡巴巴平臺上整個買家詢盤數急劇下
滑，歐美對中國採購在下滑。海關是賣了貨,出去
以後再獲得數據;而我們提前半年時間從詢盤上推
斷出世界貿易發生變化了。”
馬雲對未來的預測，是建立在對用戶行為分析的
基礎上。
Case ：馬雲成功預測2008年經濟危機
http://tech.sina.com.cn/i/2008-12-08/01422631744.shtml http://www.taoguba.com.cn/Article/797119/1
淘寶指數是一款中國消費者數據研
究平臺。
淘寶指數來瞭解淘寶搜索熱點，查
詢成交走勢，定位消費人群，研究細
分市場。
http://shu.taobao.com/
52

53
 2013年12月份申請名為“預測性物流”的專利
 根據大數據預判使用者的購買行為
 提前將這些商品運出倉庫，放到托運中心寄存
 等使用者真的下單了，立馬裝車往用戶家裡送
 目標只有一個：大幅縮減商品到達時間
From：http://www.tnc.com.cn/info/c-013005-d-3426672-p1.html
http://www.ebrun.com/20140118/90140.shtml
如果預測錯了怎麼辦呢？
Amazon會考慮給用戶較低的折扣，
類似促銷了；或者索性送人情，免費送
給你當禮品。
這項專利尚未實際使用
AmazonCEO貝索斯
參考：
 之前的訂單
 商品搜索記錄
 願望清單
 購物車(Shopping Cart)
 使用者的滑鼠在某件商品上懸停的時間。
Amazon大數據的威力--還沒下單貨已上路

54
From：http://www.chinabidding.com/zxzx-detail-222667502.html
丹麥的維斯塔斯風力技術集團，通過在世界上最大的超級電腦上部署IBM大
數據解決方案，提高風電發電效率。以前需要數周時間完成的分析工作現在只需不
到1小時即可完成。
IBM在風電場的運維管理領域：
 風電功率預測
 風電場微觀選址
 預防性維護
 績效評估
 風電場進行全生命週期的管理和優化。
IBM---大數據分析助力風電運維
數據：
 PB量級氣象報告
 潮汐相位
 地理空間
 衛星圖像等
結構化及非結構化的海量數據，從而優
化風力渦輪機佈局，提高風電發電效率。

55
http://www.itongji.cn/article/02251H22013.html
Boston, LA 城市用大數據幫助警方打擊犯罪
 University of Michigan發佈了一份報告，
詳盡闡述了一種用“超級電腦以及大量數據”
來幫助警方定位那些最易受到不法份子侵擾區
的方法。
 研究者們採用了極大數量的數據，目的是創
建一張波士頓犯罪高發地區熱點圖。
 隨著將越來越多的數據加入到研究中來，研
究者們認為他們能在額外變數是如何影響犯罪
率這一問題上得到更準確的結論。
數據來源：
 人口統計數據
 毒品犯罪數據
 各區域出售酒的種類
 相鄰片區的各種因素
 ……

警政署M-Police整合查詢系統

警政署M-Police 臉部辨識

法律授權問題
 警方依犯罪偵防需求調閱個人資料
 「內政部警政署國民身分證相片影像資料使用管理要點」依
法可調閱身分證照片，但護照照片則未有法律授權
 警方操作人臉辨識系統前須拍攝民眾肖像
 警方依警察職權行使法蒐證攝影，僅限集會遊行或其他公共
活動參與足認對公共安全或秩序有危害之虞時，未有於民事
或其他刑事範圍讓警方對特定個人拍攝肖像之法律授權
 政府大放2300萬全民身分證、護照照片資料庫供警方依業
務需求自由連線查詢
 依大法官603號捺指紋領身分證釋憲文意旨，政府蒐集2300萬
全民身分證、護照照片資料庫供警方連線作犯罪偵防應有專
法授權，僅依個資法第5條, 15條資料利用之正當合理關聯條
文，違反比例原則

60
 GE計畫在“工業互联網”項目上投入大量資金。
 GE的飛机引擎中的傳感器都是被動模式——直到出现故障才會在儀表盤上亮红灯。
這類傳感器有很多，例如测量温度、压力和电压，这些傳感数据過去很少被保留和研
究。在大多数飛行中，引擎只會保留三個平均值，分别是起飛、巡航和降落数据。
 根据Varma的介绍，GE的下一代GEnX引擎中（装備波音787飛机）将會保留每次飛行
的所有基處数據 (約1 tera)，甚至會從飛機即時傳输回GE分析。
 在GE的美西軟件研發中心，主要任務就是Industrial Internet的相關軟硬體。
Case : GE—傳感器+大数據，打造Industrial Internet
http://www.ctocio.com/ccnews/9954.html

61
Case : Big Data in Education
http://www-01.ibm.com/software/analytics/education/resources.html
MOOCs
 Huge potential from Big Data perspective.
 Learning portfolio for everyone?
 因材施教(Self-directed and adaptive learning.
 人力資源(HR development).
IBM
 Collects academic, disciplinary and attendance data from school districts.
 Analyzes over150 key metrics, and presents information in reports and dashboards.
 Develops early warning to alert teachers and counselors to at-risk students before
they drop out. Upt0 25% reduction in dropout rate.
Problem: In U.S. high schools, dropout rate is over 30% .
In Mobile County of Alabama, that stood at 48%, translating into roughly 2,500
youths.
How to reduce the annual dropout rate?

http://www.utsystem.edu/seekut/Terms.htm

63
以Baidu为例：
搜索过去5年内全世界987支球
队的3.7万场比赛数据，共涉及到
19972名球员和1.12亿条相关数据。
From：http://www.huxiu.com/article/37708/1.html
案例18：2014世界盃，德國足總+SAP合作開發Match Insights
針對2014世界盃的16場淘汰賽的預測， Google、
Microsoft、Baidu成功預測世界盃16強。
德國足總+SAP合作開發Match Insights
系統，利用場邊攝影機蒐集資料，還有
秘密武器HANA進行大數據即時分析，讓
教練掌握雙方球員狀況，擬定賽前訓練
與臨場比賽的戰略。
http://it.people.com.cn/BIG5/n/2014/0709/c1009-25257386.html

65
http://www.bigdatalandscape.com/

Big Data Challenges
1. Meeting the need for speed
 In today’s hypercompetitive business environment,
companies not only have to find and analyze the relevant
data they need, they must find it quickly. The sheer
volumes of data and accessing the level of detail needed,
all at a high speed.
2. Understanding the data
 It takes a lot of understanding to get data in the right shape.

Big Data Challenges (cont.)
3. Addressing data quality
 The value of data for decision-making purposes will be
jeopardized if the data is not accurate or timely.
4. Displaying meaningful results
 Represent analysis result becomes difficult when dealing with
extremely large amounts of information or a variety of
categories of information.
5. Dealing with outliers
 Outliers may not be representative of the data, they may also
reveal previously unseen and potentially valuable insights.

The Future of Big Data
 Stop talking about how the quality of data matters less,
We are only starting to get to a point where we are truly
able to focus on the quality of big data.
 Big data must be effectively stored, transferred,
transformed and analyzed without threatening the original
data.
Bigger, Better, Faster, Stronger

Introduction of big data

Recommended

Recommended

More Related Content

Similar to Introduction of big data

Similar to Introduction of big data (20)

Introduction of big data

Editor's Notes