SlideShare a Scribd company logo
1 of 71
Prof. 潘人豪 Pan, Ren-Hao
元智大學•大數據與數位匯流中心
Web Evolution
Web Evolution
Big Data的緣起
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile or wearable devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
 The progress and innovation is no longer hindered by the ability to collect data
 But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
10
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
Big Data的緣起
1-Scale (Volume)
 Data Volume
 44x increase from 2009-2020
 From 0.8 zetta bytes to 35zb
 Data volume is increasing exponentially
12
Exponential increase in
collected/generated data
Characteristics of Big Data
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
1012 1015 1018
1021
2-Complexity (Variety)
 Various formats, types, structures (or
unstructured ones).
 Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dimensional arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types of
data
13
To extract knowledge all these types of
data need to be linked together
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
3-Speed (Velocity)
 Data is being generated fast and need to be processed fast
 Online Real-time Data Analytics
 Late decisions  missing opportunities
 Examples
 e-Promotions: Based on your current location, your purchase history, what
you like  send promotions right now for store next to you
 Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction
14
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
Big Data: 3V’s
15
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
Some Make it 4V’s
16
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
Application Types - Data at Rest
Application Types - Data in Motion
Application Types - Streaming
Harnessing Big Data
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
20
What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
21
Big Data Technology
22
25
Case : Google成功預測H1N1在全美的傳播範圍
• 2009年在美國的H1N1爆發幾周前,Google成功預測了H1N1在全美的傳播範
圍,具體到了州還有特定地區,判斷非常及時。
• CDC疾控中心通常只能在流感爆發一兩周之後才可以做到。
• 真正第一次利用搜尋引擎大數據,對疾病控制的預測嘗試。
From: http://blog.sciencenet.cn/blog-291824-644684.html
大數據應用
方法:
 Google發現搜尋流感相關主題的使用者數量與
實際出現流感症狀的人數有著密切關聯。
Google將查詢次數與傳統流感監控系統數據進行
比較,發現某些搜尋關鍵字在流感季節特別熱門。
 因此,只要統計使用者搜尋這些關鍵字的次數,
便能預測全球各個國家及地區的流感疫情發展。 Google的研究結果也獲得《自然》期刊登載。
http://www.google.org/flutrends/intl/en_us/
百度疾病預測
 百度自身資料(搜索、微博、貼吧)與中國疾控中心
(CDC)流感監測資料結合建立預測模型。
 對比CDC提供的流感陽性率(2014.5.25值),絕對誤差
在1%以內城市占比62%,在5%以內的城市占比89%。
 其他疾病依靠百度搜自身資料,用無監督學習模型來預
測疾病熱搜動態的時空變化
http://trends.baidu.com/disease/
“Big data hubris,” or just nitpick !?
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014.
“The Parable Of Google Flu: Traps In Big Data Analysis.” Science 343 (14 March)
Big Data 中社群媒體的角色
 除了搜尋引擎對於疫情的預測,社群媒體如 Twitter、
FB 也逐漸在這場大數據競賽中找到自己的定位。
社群網路於醫療領域
 加州大學洛杉磯分校(UCLA)以 Twitter 的訊息量、
地點,來追蹤性病擴散率與毒品濫用的行為。
 加州大學蒐集了 5 億 5000 萬條「推特文」,使用演算
法篩檢出含有「性」(Sex)、「快感」(get high)的字眼,
並記錄發布內容的地區,最後用統計模型觀測這些區
域是否有 HIV 新病例通報。
 結果發現兩者之間有很顯著的關係,當某地區的推文
呈現很高的「性指數」,HIV 的新感染病例也高。
搜尋引擎與社群網路結合
 將 Google 搜尋引擎與 Twitter 結合,還能精準看出一
些社會風氣的變化。
 兩位美國經濟學家結合兩者資訊,發現當《16歲懷孕》
和《小媽咪》兩部美國影集播放時,青少年懷孕生子
數比例大幅降低。
31
Intel正在與專注Parkinson疾病研究的Michael J.Fox基金合作,開展一項從患者的
可穿戴設備收集的數據中,找到疾病模式的研究。
全球共有500萬人被診斷出患有Parkinson綜合症,是第二大神經退化性疾病。
通過可穿戴設備,研究人員可以遠端監測病人,居住在偏遠地區的人也可以參與。
這種設備有助於大範圍臨床試驗,現在很多Parkinson患者因為附近沒有合適的醫
療機構,無法參與臨床試驗。
相對於患者的主觀描述,可穿戴設備記錄的數據也更加客觀,例如病人可能會告訴
醫生他顫抖了幾分鐘,但實際上可能只能幾秒鐘。
Intel--用大數據解決Parkinson
http://www.36dsj.com/archives/11605
科技隨身用品興起
 高齡化社會,醫療進步,帶動對健康、以及最精密複
雜的人體的重視。
 德國健康生活用品品牌 beurer 推出結合心律偵測的手
錶,打破得知身體狀況需到特定地點以及配戴繁複儀
器的觀念,並結合日常隨身用品,讓科技、健康與生
活結合。
33
FlatironHealth這家醫療保健技術公司位於紐約,成立才剛兩年。
剛拿到Google創投(Google Ventures)的資金投資。
全美癌症患者共有1300多萬人,研究人員和醫生卻只能針對其中一部分人開展研究。
在美國,絕大多數癌症治療經驗都來自於臨床試驗,高達96%的病人不參加這類試驗。
另外96%的病人資訊沉澱在電子病歷(EMR)系統和醫生筆記裡。目標就是收集這96%
的病人的數據,重新整理,讓它們可以為醫生、病人和其他利益相關者所用。
美國醫療保健技術公司FlatironHealth--用大數據戰勝癌症
http://www.36dsj.com/archives/9319
Forrester Research資深分析
師斯基普•斯諾說:“Google
想要的是長生不老。他們深信
自己介入醫療保健領域就是為
了追求長壽——怎麼幫助人們
活得更長久、更健康?”
34
From:http://tieba.baidu.com/p/2900201015
Microsoft大數據成功預測奧斯卡--24中21
Microsoft紐約研究院經濟學家David
Rothschild通過大數據分析,成功預測了2014年
奧斯卡24項獎項中的21項。
2013年David Rothschild預測奧斯卡的獲獎名
單,24中19。
主要依據:
 票房收入、電影評選等這類非統計數據。
 使用Predict Market網站 上的資訊。
 User-Generated Data:網友在各個社群媒體上深
入探討入圍電影的內容。
http://www.360doc.com/content/13/0227/22/184879_268325152.shtml
Fashion trends among consumers often
change in the blink of an eye
 Philosophy of Zara
 The apparel industry stresses about the need to react rather
than predict.
 Developed a business model where speed and decentralized
decision-making was essential.
 Zara’s Fast Fashion
 Understanding the items that its customers actually want.
Strategies of Zara
 Vertical Integration
 Small Batch Production
 Collecting Vital Information for Decision Making
 Selling well objects : Type of fabric, cut, and colors
 Quick response to Demand (Pull System/Message Sharing)
 Analyze “Regional Pop”
 Make the market segmentation closest to the customer needs.
 High Product Turnover
 Strong IT System
 Real-time Knowledge(Dataflow) in the entire distribution-to-sale process
Product
• Quick Change
Artist
Production
• Supply Chain
Management
Logistic
• Inventory
Workflow
Innovation
Selling
• Real-time
Customer
Service
• Online Shop
Data Flow in Quick Change Artist
: Data Communication
Inventory Workflow Innovation
Inventory Workflow Innovation
 High-velocity shipping: Rapid Information flows
 Stores: Electronically connected to headquarters
 Logistics system: Speed and flexibility
 Products:
 Selected
 Sorted
 Routed
 Delivered
 Local distribution center
 Retail store stockrooms
Customer Service
O2O : Offline to Online, vice versâ
Customer Service
Degree of System Implementation at Zara
Zara Online Shop
 Collect feedback to manufacturing
 Find out the target market exactly
 Held consumer opinion survey,
capture customer feedback to
improve the actual shipping
products
vs. Big Data
Information Integration, Focus on customer requirement, Decentralized decision-making
In-store Online Shop
Customer Behavior
PoS
Click Tracking
Online Fourm
Consumer survey
DATA
Daily Report
High-velocity
shipping
Prototype Survey
Real-Time Data
Fashion Analysis
market segmentation
Quick Change Artist
Agile Management
Analytical Culture in Zara
Online Retail Websites KPIs Company Marketing KPIs
 Purchase conversion
 Average Order Size
 Items per Order
 Purchase dropouts rate
 Effect on offline sales
 Returned items rate
 Response rate by segment
 Response rate by the marketing
media
 Response rate by marketing
message
 Cost per marketing
campaign/cost per sale
 Revenue per marketing
campaign/revenue per sale
Company Strategic KPIs
 Ratio of winning designs
 Ratio of cross-brand conversions (in INDITEX retail group)
馬雲的判斷來自於數據分析
“2008年初,阿裡巴巴平臺上整個買家詢盤數急劇下
滑,歐美對中國採購在下滑。海關是賣了貨,出去
以後再獲得數據;而我們提前半年時間從詢盤上推
斷出世界貿易發生變化了。”
馬雲對未來的預測,是建立在對用戶行為分析的
基礎上。
Case :馬雲成功預測2008年經濟危機
http://tech.sina.com.cn/i/2008-12-08/01422631744.shtml http://www.taoguba.com.cn/Article/797119/1
淘寶指數是一款中國消費者數據研
究平臺。
淘寶指數來瞭解淘寶搜索熱點,查
詢成交走勢,定位消費人群,研究細
分市場。
http://shu.taobao.com/
52
53
 2013年12月份申請名為“預測性物流”的專利
 根據大數據預判使用者的購買行為
 提前將這些商品運出倉庫,放到托運中心寄存
 等使用者真的下單了,立馬裝車往用戶家裡送
 目標只有一個:大幅縮減商品到達時間
From:http://www.tnc.com.cn/info/c-013005-d-3426672-p1.html
http://www.ebrun.com/20140118/90140.shtml
如果預測錯了怎麼辦呢?
Amazon會考慮給用戶較低的折扣,
類似促銷了;或者索性送人情,免費送
給你當禮品。
這項專利尚未實際使用
AmazonCEO貝索斯
參考:
 之前的訂單
 商品搜索記錄
 願望清單
 購物車(Shopping Cart)
 使用者的滑鼠在某件商品上懸停的時間。
Amazon大數據的威力--還沒下單貨已上路
54
From:http://www.chinabidding.com/zxzx-detail-222667502.html
丹麥的維斯塔斯風力技術集團,通過在世界上最大的超級電腦上部署IBM大
數據解決方案,提高風電發電效率。以前需要數周時間完成的分析工作現在只需不
到1小時即可完成。
IBM在風電場的運維管理領域:
 風電功率預測
 風電場微觀選址
 預防性維護
 績效評估
 風電場進行全生命週期的管理和優化。
IBM---大數據分析助力風電運維
數據:
 PB量級氣象報告
 潮汐相位
 地理空間
 衛星圖像等
結構化及非結構化的海量數據,從而優
化風力渦輪機佈局,提高風電發電效率。
55
http://www.itongji.cn/article/02251H22013.html
Boston, LA 城市用大數據幫助警方打擊犯罪
 University of Michigan發佈了一份報告,
詳盡闡述了一種用“超級電腦以及大量數據”
來幫助警方定位那些最易受到不法份子侵擾區
的方法。
 研究者們採用了極大數量的數據,目的是創
建一張波士頓犯罪高發地區熱點圖。
 隨著將越來越多的數據加入到研究中來,研
究者們認為他們能在額外變數是如何影響犯罪
率這一問題上得到更準確的結論。
數據來源:
 人口統計數據
 毒品犯罪數據
 各區域出售酒的種類
 相鄰片區的各種因素
 ……
警政署M-Police整合查詢系統
警政署M-Police 臉部辨識
法律授權問題
 警方依犯罪偵防需求調閱個人資料
 「內政部警政署國民身分證相片影像資料使用管理要點」依
法可調閱身分證照片,但護照照片則未有法律授權
 警方操作人臉辨識系統前須拍攝民眾肖像
 警方依警察職權行使法蒐證攝影,僅限集會遊行或其他公共
活動參與足認對公共安全或秩序有危害之虞時,未有於民事
或其他刑事範圍讓警方對特定個人拍攝肖像之法律授權
 政府大放2300萬全民身分證、護照照片資料庫供警方依業
務需求自由連線查詢
 依大法官603號捺指紋領身分證釋憲文意旨,政府蒐集2300萬
全民身分證、護照照片資料庫供警方連線作犯罪偵防應有專
法授權,僅依個資法第5條, 15條資料利用之正當合理關聯條
文,違反比例原則
60
 GE計畫在“工業互联網”項目上投入大量資金。
 GE的飛机引擎中的傳感器都是被動模式——直到出现故障才會在儀表盤上亮红灯。
這類傳感器有很多,例如测量温度、压力和电压,这些傳感数据過去很少被保留和研
究。在大多数飛行中,引擎只會保留三個平均值,分别是起飛、巡航和降落数据。
 根据Varma的介绍,GE的下一代GEnX引擎中(装備波音787飛机)将會保留每次飛行
的所有基處数據 (約1 tera),甚至會從飛機即時傳输回GE分析。
 在GE的美西軟件研發中心,主要任務就是Industrial Internet的相關軟硬體。
Case : GE—傳感器+大数據,打造Industrial Internet
http://www.ctocio.com/ccnews/9954.html
61
Case : Big Data in Education
http://www-01.ibm.com/software/analytics/education/resources.html
MOOCs
 Huge potential from Big Data perspective.
 Learning portfolio for everyone?
 因材施教(Self-directed and adaptive learning.
 人力資源(HR development).
IBM
 Collects academic, disciplinary and attendance data from school districts.
 Analyzes over150 key metrics, and presents information in reports and dashboards.
 Develops early warning to alert teachers and counselors to at-risk students before
they drop out. Upt0 25% reduction in dropout rate.
Problem: In U.S. high schools, dropout rate is over 30% .
In Mobile County of Alabama, that stood at 48%, translating into roughly 2,500
youths.
How to reduce the annual dropout rate?
http://www.utsystem.edu/seekut/Terms.htm
63
以Baidu为例:
搜索过去5年内全世界987支球
队的3.7万场比赛数据,共涉及到
19972名球员和1.12亿条相关数据。
From:http://www.huxiu.com/article/37708/1.html
案例18:2014世界盃,德國足總+SAP合作開發Match Insights
針對2014世界盃的16場淘汰賽的預測, Google、
Microsoft、Baidu成功預測世界盃16強。
德國足總+SAP合作開發Match Insights
系統,利用場邊攝影機蒐集資料,還有
秘密武器HANA進行大數據即時分析,讓
教練掌握雙方球員狀況,擬定賽前訓練
與臨場比賽的戰略。
http://it.people.com.cn/BIG5/n/2014/0709/c1009-25257386.html
64
65
http://www.bigdatalandscape.com/
Big Data Challenges
1. Meeting the need for speed
 In today’s hypercompetitive business environment,
companies not only have to find and analyze the relevant
data they need, they must find it quickly. The sheer
volumes of data and accessing the level of detail needed,
all at a high speed.
2. Understanding the data
 It takes a lot of understanding to get data in the right shape.
Big Data Challenges (cont.)
3. Addressing data quality
 The value of data for decision-making purposes will be
jeopardized if the data is not accurate or timely.
4. Displaying meaningful results
 Represent analysis result becomes difficult when dealing with
extremely large amounts of information or a variety of
categories of information.
5. Dealing with outliers
 Outliers may not be representative of the data, they may also
reveal previously unseen and potentially valuable insights.
The Future of Big Data
 Stop talking about how the quality of data matters less,
We are only starting to get to a point where we are truly
able to focus on the quality of big data.
 Big data must be effectively stored, transferred,
transformed and analyzed without threatening the original
data.
Bigger, Better, Faster, Stronger
報告完畢.敬請指教

More Related Content

Similar to Introduction of big data

數位培能企業常見系統平台 與臺灣雲市集—20211112
數位培能企業常見系統平台與臺灣雲市集—20211112數位培能企業常見系統平台與臺灣雲市集—20211112
數位培能企業常見系統平台 與臺灣雲市集—20211112張大明 Ta-Ming Chang
 
The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...
The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...
The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...Deborah Weinswig
 
新媒體政策行銷 (新北市政府)
新媒體政策行銷 (新北市政府)新媒體政策行銷 (新北市政府)
新媒體政策行銷 (新北市政府)Yeong-Long Chen
 
數位落差與發展策略(黃勝雄老師)
數位落差與發展策略(黃勝雄老師)數位落差與發展策略(黃勝雄老師)
數位落差與發展策略(黃勝雄老師)tahr1984
 
談已發展與發展中國家數位落差
談已發展與發展中國家數位落差談已發展與發展中國家數位落差
談已發展與發展中國家數位落差Kenny Huang Ph.D.
 
雲端運算與數位策展
雲端運算與數位策展雲端運算與數位策展
雲端運算與數位策展子軒 簡
 
資料科學計劃的成果與展望
資料科學計劃的成果與展望資料科學計劃的成果與展望
資料科學計劃的成果與展望Johnson Hsieh
 
Ogma Project Roadmap - 20160601
Ogma Project Roadmap - 20160601Ogma Project Roadmap - 20160601
Ogma Project Roadmap - 20160601Yannick Lin
 
從社群資料來看 工人(群眾)智慧與人工智慧 的結合
從社群資料來看 工人(群眾)智慧與人工智慧 的結合從社群資料來看 工人(群眾)智慧與人工智慧 的結合
從社群資料來看 工人(群眾)智慧與人工智慧 的結合Gene Hong
 
IT445_Week_12.pdf
IT445_Week_12.pdfIT445_Week_12.pdf
IT445_Week_12.pdfAiondBdkpt
 
如果您来自广告公关公司
如果您来自广告公关公司如果您来自广告公关公司
如果您来自广告公关公司Phil Ren
 
搜索 VS 查询
搜索 VS 查询搜索 VS 查询
搜索 VS 查询liluming
 
薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐hdhappy001
 
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況Jazz Yao-Tsung Wang
 
魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题hdhappy001
 
20151016 中興大學 big data + machine learning
20151016 中興大學 big data + machine learning20151016 中興大學 big data + machine learning
20151016 中興大學 big data + machine learningMeng-Ru (Raymond) Tsai
 
数据科学家
数据科学家数据科学家
数据科学家Felix Liu
 
Report 106553012 - copy
Report 106553012 - copyReport 106553012 - copy
Report 106553012 - copyJacky Zou
 

Similar to Introduction of big data (20)

數位培能企業常見系統平台 與臺灣雲市集—20211112
數位培能企業常見系統平台與臺灣雲市集—20211112數位培能企業常見系統平台與臺灣雲市集—20211112
數位培能企業常見系統平台 與臺灣雲市集—20211112
 
The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...
The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...
The Future of Retail - Deborah W.'s Presntation for the 13th Department Store...
 
新媒體政策行銷 (新北市政府)
新媒體政策行銷 (新北市政府)新媒體政策行銷 (新北市政府)
新媒體政策行銷 (新北市政府)
 
Dmresearch
DmresearchDmresearch
Dmresearch
 
數位落差與發展策略(黃勝雄老師)
數位落差與發展策略(黃勝雄老師)數位落差與發展策略(黃勝雄老師)
數位落差與發展策略(黃勝雄老師)
 
談已發展與發展中國家數位落差
談已發展與發展中國家數位落差談已發展與發展中國家數位落差
談已發展與發展中國家數位落差
 
雲端運算與數位策展
雲端運算與數位策展雲端運算與數位策展
雲端運算與數位策展
 
資料科學計劃的成果與展望
資料科學計劃的成果與展望資料科學計劃的成果與展望
資料科學計劃的成果與展望
 
Ogma Project Roadmap - 20160601
Ogma Project Roadmap - 20160601Ogma Project Roadmap - 20160601
Ogma Project Roadmap - 20160601
 
從社群資料來看 工人(群眾)智慧與人工智慧 的結合
從社群資料來看 工人(群眾)智慧與人工智慧 的結合從社群資料來看 工人(群眾)智慧與人工智慧 的結合
從社群資料來看 工人(群眾)智慧與人工智慧 的結合
 
IT445_Week_12.pdf
IT445_Week_12.pdfIT445_Week_12.pdf
IT445_Week_12.pdf
 
如果您来自广告公关公司
如果您来自广告公关公司如果您来自广告公关公司
如果您来自广告公关公司
 
搜索 VS 查询
搜索 VS 查询搜索 VS 查询
搜索 VS 查询
 
薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐
 
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
 
魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题
 
20151016 中興大學 big data + machine learning
20151016 中興大學 big data + machine learning20151016 中興大學 big data + machine learning
20151016 中興大學 big data + machine learning
 
数据科学家
数据科学家数据科学家
数据科学家
 
手搖飲 市場情報監測架構
手搖飲 市場情報監測架構手搖飲 市場情報監測架構
手搖飲 市場情報監測架構
 
Report 106553012 - copy
Report 106553012 - copyReport 106553012 - copy
Report 106553012 - copy
 

Introduction of big data

Editor's Notes

  1. 國際資料公司(IDC)的研究結果表明,2008年全球產生的資料量為0.49ZB,2009年的數據量為0.8ZB,2010年增長為1.2ZB,2011年的數量更是高達1.82ZB,相當於全球每人產生200GB以上的資料。
  2. ETL : Extract, Transform and Load (ETL) ECL : 'Extensible Computer Language
  3. http://tieba.baidu.com/p/2240499730 2012年1月的统计结果显示美国有1370万个癌症生还者。
  4. 由四個面向探討 Zara如何利用Big Data
  5. Ratio of winning designs – given the amount of new designs churned out by the company annually – 11 thousand vs. two – four thousand by other major fashion brands, it is very important to cull out the losers and focus on the winning designs. This also can be helpful in identifying the broader fashion trends in order to align the design team’s efforts with the customers’ demands. Ratio of cross-brand conversions – Zara is a leading brand in the INDITEX retail group. By analyzing customers’ purchases, behavior and attitudes collected through multiple channels it is extremely important to be able to monetize the existing customers to the fullest extent possible. For example, referral of Zara Home customers to Zara and vice versa  could prove to be an efficient and effective way of monetizing existing customers due to the brands recognition. Purchase conversion – the number of purchases over the total number of visits. This metric is specifically helpful if it is tied-in with the changes made to the site itself and the changes to the product inventory. Average Order Size – the dollar/euro/other currency amount spent on each order. For most retailers, including Zara, it is an important metric, given that the profit margins are often related to the dollar value of the purchase. Therefore the bigger monetary value of the order, the lower is the overhead. Items per Order – shows the effectiveness of cross-selling on the site. It can also be connected to the effectiveness of specific promotional campaign. This helps to optimize recommendations to various groups of customers and fine tune the marketing message. Purchase dropouts rate – the number of customers who abandoned the purchase to the total number of customers who started the purchasing process. This metric is extremely helpful in identifying what steps in the sales process deter the customers from completing the transactions. The cause of this dropout could be in the website design or in disclosing additional information to the customer. For example, customers may abandon the purchase once the shipping cost is added to the total cost. This information may give further insights on customers’ price sensitivity or timeliness of the order, and prompt Zara to optimize its shipping and handling operations. Effect on offline sales – this metric is usually difficult to measure, but is extremely important to capture. Not all online visits may end up with purchase, but some of those visitors could be gathering additional information before visiting the store to make a purchase. This metric could be partially measured by offering some additional service or gift at the store with the code provided online. Returned items rate – the percentage of returned items purchased online. It is critical to receive and analyze customer’s feedback on the reasons of the return. For example this metric may indicate deficiencies in product presentation online, whether it is the picture of the item or its verbal description. Could be helpful in improving the copy quality and clarity. Response rate by segment – after the customers segmentation is done, Zara can measure which segment responses better to particular marketing campaign. Segmentation can be done by demographic, behavioral, attitudinal, geographic and other criteria. Response rate by the marketing media – will measure if customers respond better to regular mail, email, website promotion or in-store promotional offers. Response rate by marketing message – this can be done in any of the marketing channels: online, in-store, mailing campaign. However online testing is offering the most efficient way of testing the marketing message due to its near real time availability of the analysis of the data. After the message is tested online, it can be transferred to other marketing channels with greater level of assurance about its effectiveness. Cost per marketing campaign/cost per sale – measures effectiveness of a particular campaign in monetary terms. Revenue per marketing campaign/revenue per sale – measures the overall revenue generated by a particular campaign or by an average sale in the campaign.
  6. 美國高中生和大學生的糟糕表現:高中生退學率高達30%(平均每26秒就有一個高中生退學),33%的大學生需要重修,46%的大學生無法正常畢業。
  7. 百度大數據部利用大數據搜索過去5年內全世界987支球隊的3.7萬場比賽數據,共涉及到19972名球員和1.12億條相關數據,這些數據的來源基本都是互聯網,再利用一個由搜索專家設計的機器學習模型來對這些數據進行匯總和分析,進而做出預測結果。 針對本屆世界盃的16場淘汰賽的預測,準確度達到了100%。而今年頻頻爆冷的小組賽階段,百度對比賽結果的預測準確率也達到了58.33%,這一結果高於微軟語音助手Cortana和必應搜索聯合得出的56.25%的準確率。但是100%的準確度也只是猜勝負,德國和巴西7:1的結果就和百度預測的,德國將會以微弱的優勢贏下巴西(51%對49%),二者相去甚遠。
  8. 科技誕生的促動期 (Technology Trigger)[編輯] 在此階段,隨著媒體大肆的報導過度,非理性的渲染,產品的知名度無所不在,然而隨著這個科技的缺點、問題、限制出現,失敗的案例大於成功的案例,例如:.com公司 1998~2000年之間的非理性瘋狂飆升期。 過高期望的峰值(Peak of Inflated Expectations)[編輯] 早期公眾的過分關注演繹出了一系列成功的故事——當然同時也有眾多失敗的例子。對於失敗,有些公司採取了補救措施,而大部分卻無動於衷。 泡沫化的底谷期 (Trough of Disillusionment)[編輯] 在歷經前面階段所存活的科技經過多方扎實有重點的試驗,而對此科技的適用範圍及限制是以客觀的並實際的了解,成功並能存活的經營模式逐漸成長。 穩步爬升的光明期 (Slope of Enlightenment)[編輯] 在此階段,有一新科技的誕生,在市面上受到主要媒體與業界高度的注意,例如:1996年的Internet ,Web。 實質生產的高峰期 (Plateau of Productivity)[編輯] 在此階段,新科技產生的利益與潛力被市場實際接受,實質支援此經營模式的工具、方法論經過數代的演進,進入了非常成熟的階段。
  9. From : SAS 2014, “Five big data challenges And how to overcome them with visual analytics” http://www.sas.com/resources/asset/five-big-data-challenges-article.pdf
  10. 過去的重點在於量夠大,便能產生質,將來則需要更focus在值 http://www.cmswire.com/cms/big-data/bigger-better-faster-stronger-the-future-of-big-data-027026.php Better Big Data To make big data better, we need to stop talking about how the quality of data matters less in a big data world. If quantity and repetition determined the value of data, we would probably assume that every Twitter utterance by Kim Kardashian and Justin Bieber would be more meaningful than the combined works of Shakespeare. Although some big data scientist of the future may look back at the 21st century and determine that this is the case, this finding would only prove that we as a culture had never solved the true challenges of big data. We are only starting to get to a point where we are truly able to focus on the quality of big data. Wikipedia has over 70,000 active contributors to clean up its big data and to keep the environment clean over time. As an open community, Wikipedia has become the standard of showing how the quality and improvement of big data can actually occur. This evolution is still only at its starting point. Pure programmatic automation efforts to make data "better" currently lack the nuance and contextual knowledge to result in improved recommendations. In truth, the vast majority of enterprise data is typically siloed or otherwise inaccessible to the employees, partners and customers who would actually be able to correct the problem. And we are only starting to see the launch of self-service and automated data quality tools that will give line-of-business employees the power to fix their own data with startup software from Paxata, Trifacta, Tamr, and the efforts of larger vendors such asIBM's Watson Analytics and Informatica's Springbok. Until we put the power of data quality into the hands of the masses, big data will struggle to become better. Stronger Big Data Another key issue with big data — especially as it continues to outstrip the volumes of traditional data solutions — is the challenge of maintaining its purity and context. This challenge ranges from the high level challenges of business continuity and disaster recovery to the most granular challenges of data corruption. It's handled by a combination of data scanning, file detection, data replication, data integration and data recovery. But in between all of these areas are gaps that prevent big data from being as strong and resilient as it needs to be. Regardless of volume, velocity and variety, big data must be effectively stored, transferred, transformed and analyzed without threatening the original data. This means that companies must figure out how to bring their storage, transfer, recovery and scrubbing activities together into an integrated big data resiliency department. One of the biggest challenges to these efforts is to synchronize internal enterprise data scrubbing and replication efforts with similar efforts conducted by managed cloud service vendors. Although the techniques, technologies and efforts may be similar in nature, even simple operational challenges such as matching the frequency and performance metrics across a hybrid environment can be difficult to manage. But as we think about the strength of big data, it is increasingly important to bridge the gaps between firmware error detection, data integrity, data scrubbing, data replication and data management.