Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

巨量與開放資料之創新機會與關鍵挑戰-曾新穆

4,426 views

Published on

巨量與開放資料之創新機會與關鍵挑戰-曾新穆

Published in: Data & Analytics
  • Follow the link, new dating source: ❶❶❶ http://bit.ly/39pMlLF ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/39pMlLF ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

巨量與開放資料之創新機會與關鍵挑戰-曾新穆

  1. 1. 巨量與開放資料之創新機會與關鍵挑戰巨量與開放資料之創新機會與關鍵挑戰 Vincent S. Tseng (曾新穆) D t t f C t S iDepartment of Computer Science National Chiao Tung University T i 1 Taiwan
  2. 2. Starting with Some Innovative Applications
  3. 3. Google Flu Trendsg J. Ginsberg, et al., Detecting influenza epidemics i h i d tusing search engine query data, Nature, February 2009 Link:- www.google.com/flutrends
  4. 4. Application in Movie Industrypp y  電影【復仇者聯盟】: 成本兩億美金 電影【復仇者聯盟】: 成本兩億美金  如何知道觀眾之興趣反應?  如何訂定最佳之行銷策略? 4
  5. 5. Application in Movie Industry (cont.)pp y ( )  利用Big Data Analytics 監測分析社交媒體對電影 預告片之反應:  11億條 Tweets/min 萬篇 570萬篇Blogs/min  350萬條 Messages/min  擷取關鍵訊息 分析主題 判斷網友意向 → 歸結出網友對電影預告 擷取關鍵訊息, 分析主題, 判斷網友意向 → 歸結出網友對電影預告 片之看法與評價  電影公司針對分析結果進行行銷策略之調整  【復仇者聯盟】票房:  2012年5月上片後, 美國本土首周票房達兩億美金(成本),寫下全美 影史最高首周票房紀錄  2012年總票房達15億美金, 成為世界電影史票房排名第三名, 僅次於 ”阿凡達” 、“鐵達尼號”阿凡達 鐵達尼號 5
  6. 6. Architecture for Big Data Analytics High-Performance Computing Platform g y High Performance Computing Platform (Cloud, Stream, In-Memory, …) DataADataADataADataA Mining & Learning Components Rules RetrieveRules Retrieve •Clusters •Association …….Reports AccessCAPAccessCAPAccessCAPAccessCAP D tD t Components Data MiningData Mining Rules Retrieve Components Rules Retrieve Components Input C ++C ++ Predictive Models Models/ Rules IIII Data Preparation Components Data Preparation Components Text MiningText Mining Machine LearningMachine Learning Prediction Components Prediction Components Data • Structured • Unstructured Rules Statistical LearningStatistical Learning Applications Module Applications Module Interesting Patterns Data Preparation Deploy Data Access Data Modeling Presentation /Applications 7
  7. 7. Tackling Some Key Challengesg y g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 8
  8. 8. Some Key Challengesy g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 9
  9. 9. 10
  10. 10. Netflix Overview 11
  11. 11. 商業模式商業模式 12
  12. 12. 個人化推薦系統個人化推薦系統 13
  13. 13. 個人化推薦系統(cont.)個人化推薦系統( )  推薦系統 & 過濾系統 推薦系統 & 過濾系統  利用Big Data Analytics分析客戶偏好度  提供非熱門影片以平衡與滿足客戶需求,非熱門影片租 提供非熱門影片以平衡與滿足客戶需求,非熱門影片租 借佔了七成  當您被推薦的冷門電影卻非常好看,那種感覺是無可比 當您被推薦的冷門電影卻非常好看 那種感覺是無可比 擬的  四分之三的推薦影片評價比最新發行的影片還高,這就 是推薦系統的真正價值  世界上最龐大的電影評比資料庫,遠超過競爭對手所能 提供的服務價值 14
  14. 14. Big Data in Netflixg  62M+ Subscribers over 50 countries 62M+ Subscribers over 50 countries  4M/day Ratings  3M/day Searches  30+M/day plays30 M/day plays  Streaming hours 2B h i Q1/2012 2B hours in Q1/2012  10B hours in Q1/2015 15
  15. 15. Netflix Prize  Grand Prize, $1M USD for 10% improvement in prediction accuracy  Progress Prize, $50,000 USD every yearg , $ , y y  Since Oct. 2, 2006 E d O t 2 2011 End Oct. 2, 2011  Or when some teams reach 10% goal 16 (Ref: Netflix 2012 )
  16. 16. Recommendation Problem: Collaborative Filtering based Methods- Collaborative Filtering-based Methods itm1 itm2 itm3 itm4 itm5 A d ? 1 1 4 5 User-based Collaborative Filtering Andre ? 1 1 4 5 Ben 1 2 0 2 0 Juice 3 1 2 4 5 User based Collaborative Filtering David 1 1 0 1 0 itm1 itm2 itm3 itm4 itm5Item-based Collaborative Filtering 1 2 3 4 5 Andre ? 1 0 4 5 Ben 1 2 0 2 0 g Juice 3 1 2 4 5 David 1 1 0 1 0 if i itm1 itm2 itm4 itm3 itm5 Andre ? 1 4 0 5 Ben 1 2 2 0 0 Unifying User-based and Item-based Collaborative Filtering 17 Ben 1 2 2 0 0 Juice 3 1 2 4 5 David 1 1 1 0 0
  17. 17. Netflix Analytics Worky  Dataset consists of 100M+ training entries Dataset consists of 100M+ training entries  Each training entry is in a quadruplet form  <user, movie, date, grade>, each is an integer  The qualifying dataset consists of 2.8M entriesq y g  <user, movie, date> w/o grading  Error measure: RMSE (root mean square error) Error measure: RMSE (root mean square error) 18
  18. 18. RMSE Scores  0 8563 (10%) Grand Prize 0.8563 (10%) Grand Prize  0.8643 (9.15%) Leader  0.8667 (8.9%) Current progress  0.8712 (8.43%) Progress Prize Winner 20070.8712 (8.43%) Progress Prize Winner 2007  0.9514 (0%) Netflix Cinematch 1 0540 ( 10 78%) M i A 1.0540 (-10.78%) Movie Average 19
  19. 19. 2009 Grand Prize Winner: BellKor's Pragmatic Chaos 20
  20. 20. Challengesg  Data Sparsity Problemp y  Highly Sparse Data & Cold Start Problem: traditional approaches like CF are not feasibletraditional approaches like CF are not feasible → Need specialized method  Netflix Prize winner: Pragmatic Chaos Theory Netflix Prize winner: Pragmatic Chaos Theory  Gap between complex models and deployment  Winner’s solution: Complex composition of hundreds/thousands of learned models → Hard to deploy in real applications  Similar scenarios exist in many big datay g applications and effective solutions are desired! 21
  21. 21. Some Key Challengesy g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 22
  22. 22. I bi l h b ?Is bigger always the better? Veracity issue-- Veracity issue
  23. 23. Google Flu Trendsg J. Ginsberg, et al., Detecting influenza epidemics i h i d tusing search engine query data, Nature, February 2009 Link:- www.google.com/flutrends
  24. 24. Google Flu Trends -- Ideag • C t i W b S h• Certain Web Search terms are good Indicators of flu activity. • Google Trend uses Aggregated search data on flu indicators.on flu indicators. • Estimate current flu activity around the world i l tiin real time. • From example :- Google Flu Trend detectsFlu Trend detects increased flu activity two weeks before CDC. *CDC: Center for Disease Control
  25. 25. Google Flu Trends -- Modelg  Data:  Look at all search queries in Google from 2003 to 2008 Look at all search queries in Google from 2003 to 2008  Several hundred billion individual searches in the United States  Keep track of only the 50 million most common queries  Keep a weekly count for each query  Also keep counts of each query by geographic region (requires use of geo-location from IP addresses: >95% accurate) So counts for 50 million queries x 170 weeks x 9 regions query selectionq g  Target variable to be predicted:  For each week, for each region I(t) = percentage physician visits that are ILI (as compiled by CDC) query selection I(t) = percentage physician visits that are ILI (as compiled by CDC)  Input variable: Q(t) = sum of top n highest correlated queries / total number of queries that week Constructing the ILI-related query/ total number of queries that week “M d l l i ” q y fraction  “Model learning”: log( I(t) / [1 – I(t)] ) =  log ( Q(t)/ [1 – Q(t) ] ) + noise Logistic regression
  26. 26. The Parable of Google Flu: Traps in Bigg p g Data Analysis (Science, Mar. 2014)
  27. 27. Some Key Challengesy g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 28
  28. 28. Deep Understanding of Key Featuresp g y
  29. 29.  A large-scale research initiative aimed at  Innovations around smartphone-based research  Collect smartphone data in everyday life conditions  Community-based evaluation of related mobile data analysis methodologiesmethodologies  Data source: Lausanne Data Collection Campaign 30
  30. 30. User Profile/Behavior Modeling and Prediction  Personal information  Media files  Device information  Process  Calendar  Applications  Social information  Accelerometer  System Information  Location information  Call log  Contacts  Bluetooth  GSM  WLAN  Sequence of place visits
  31. 31. MDC 2012 Tracks  Main Goals  User Profile/Behavior Modeling and Prediction  Dedicated Track Dedicated Track  Demographic attribute prediction  Predict gender age group marital status job type etc Predict gender, age group, marital status, job type, etc. of an user  Semantic place prediction Semantic place prediction  Predict the semantic meaning of user’s visited places N t l di ti Next place prediction  Predict the next destination of a user 32
  32. 32. Demographic Attribute Prediction  One of the items: Prediction of gender g p  One of the items: Prediction of gender 33
  33. 33. 34
  34. 34. Modeling Flowg 35
  35. 35. Demographic Attribute Prediction  Lots of features could be extracted from data g p  10,000+ features used by the winner team!  High accuracy achieved: 96% ………………Location features Media features Sensor features 36
  36. 36. Very high dimensional complexityVery high dimensional complexity - Feasibility problem in real applications!Feasibility problem in real applications! Is there some key/dominating feature? ………………Location…… features Media features S f tSensor features 37
  37. 37. Demographic Attribute Prediction (cont.)  Accelerometer is actually a key/dominating g p ( ) feature!  Support accuracy around 95%  Underlying reasoning? 38
  38. 38. Very Different behavior between the Male & Female ! 39
  39. 39. Some Key Challengesy g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 40
  40. 40. Timeliness in Big Data Analyticsg y 41 (Source: IBM white paper)
  41. 41. One Solution: Data Samplingp g - Bias on Data Samples T i id i l f h Twitter provides two main outlets for researchers to access tweets in real time:  Streaming API (~1% of all public tweets, free)  Firehose (100% of all public tweets, costly)  Streaming API data is often used by researchers to validate hypotheses.  How well does the sampled Streaming API data measure the true activity on Twitter? 42
  42. 42. Bias on Data Samples (cont.)p ( ) S [H Li l AAAI ICWSM2013] 43 Source: [Huan Liu et al. AAAI ICWSM2013]
  43. 43. 全民健保資料 44
  44. 44. National Health Insurance Research Database in Taiwan  National Health Insurance (NHI ) National Health Insurance (NHI )  Established in March 1, 1995  Serves 99.2% of Taiwanese population (20M+)  Covers 92.62% of medical institutions  Longitudinal Health Insurance Database ( LHID )  sampled from NHIRDp  Including 951,044 people health records  1997 – now Strongly representative in Taiwan Strongly representative in Taiwan  Every living regions  Big time interval 15+ years Reference : National Health Insurance, http://www.nhi.gov.tw
  45. 45. Linking with More Heterogeneous Datasets Environmental Smart Environmental monitoring data Lab data & PatientLab data & Patient CRCRNHINHI CODCODBRBR Smart Health Risk Al treported outcomereported outcome Cloud Sensor-based biomarker Alert Computing Sensor-based biomarker monitoring data 46
  46. 46. 健保資料抽樣方式健保資料抽樣方式  資料內容  以2010年承保資料檔中「2010年在保者」隨機取100萬人  抽樣母體群  由中央健康保險署所提供的2010年承保資料檔以「身份證字 號加上生日加上性別」歸人,可得 27,378,403人之資料, 作為資料母檔。作為資料母檔  抽樣方法  利用隨機值產生器(random number generator)產生至少100 利用隨機值產生器(random number generator)產生至少100 萬個隨機值(random number, 實得1,074,263個隨機值),取 與100萬個隨機值相同的流水號,來隨機抽取所需的保險對 象樣本。象樣本  關於隨機值產生作業,係採用Oracle的DBMS_RANDOM套件來 執行。 資料來源: 全民健康保險研究資料庫, http://nhird.nhri.org.tw/date_cohort.htm
  47. 47. 健保資料抽樣方式(cont.)健保資料抽樣方式( ) 萬樣本與抽樣母群體 全人口 之驗證方式 100萬樣本與抽樣母群體(全人口) 之驗證方式  統計資料中年齡、性別、每年出生人數分佈,以及 平均投保金額,比較100萬樣本與抽樣母群體之間是 否有差異  同時並與內政部公佈之資料值比較  以卡方分析分析100萬人樣本對抽樣母群體之代表性  均在5%顯著水準以下 資料來源: 全民健康保險研究資料庫, http://nhird.nhri.org.tw/date_cohort.htm
  48. 48. 疾病因子分析 Linked data is biased! 測站 日期 每日X疾病就診人數大氣環境資料 監測站 49 空氣汙染資料 監測站 監測站 使用LHID2000百萬抽樣檔
  49. 49. Some Key Challengesy g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 50
  50. 50. Mining User PreferenceMining User Preference - for POI Recommendation
  51. 51. Goal • How to do POI recommendation by utilizing user’s i l t k l ( h k i )?social network log (eg, check-in)? 1 3 4 6 5 7 8 9S 2 3 9 10S p S p 2 1 - 52 -
  52. 52. Urban Point of Interest Recommendation byUrban Point-of-Interest Recommendation by Mining User Check-in Behaviors Josh Jia-Ching Ying, Eric Hsueh-Chan Lu, Wen-Ning KuoJosh Jia Ching Ying, Eric Hsueh Chan Lu, Wen Ning Kuo and Vincent S. Tseng 2012ACM SIGKDD Int’l Workshop on Urban Computing2012ACM SIGKDD Int l Workshop on Urban Computing (UrbComp 2012)
  53. 53. Proposed Method – UPOI-Minep LBSN Dataset Social Factor User-POI Graph Construction Relevance Learning LBSN Dataset Social Factor User-POI Graph Construction Relevance Learning Individual Preference Construction Individual Preference Construction Feature Extraction POI Popularity - User-POI Relevance Matrix Feature Extraction POI Popularity - User-POI Relevance Matrix UserRequest Top k Nearest POI selection Top k Nearest POI POI RankingUserRequest Top k Nearest POI selection Top k Nearest POI POI Ranking POI Recommending List POI Recommending ListPOI Recommendation
  54. 54. Social Factor (SF)( )  F Weight kikiki DisSimwCheckSimw )1(Relation    k i,kk,jji Interest,POIuserSF 1 ]Relation[)( kikiki ,,, )(  , jkcheckin Interest   || 1 , , S s sk jk checkin Interest F f i d fF: friends of user i S: the set of POIs Check-in k,* = check-ins of user k at POI*
  55. 55. Individual Preference (IP)( ) highlight category • Individual Preference(IP) • HPrefi,h • CPrefi category • CPrefi,c ),POIIP(user ji    Pr)1()POI(Pr , )( HCount HCount efHIefC Hh jh i,h C jcctgi,c               asdefinedfunctionindicatoranis)I(where, , s,c HCountHh Hg jgCc             otherwise0 )(POIif1 )POI()( cctg I j jcctg  otherwise0
  56. 56. POI Popularity (PP)  POI Popularity p y ( )  POI Popularity  Relative Popularity of POI  Normalized based on category checkins RP j j   .POIithcategory wsamein thewhichPOIsofsettheiswhere, POI jCS checkins CS k j k  .Otcatego y wsa et ew cO sosett esw e e, jCS
  57. 57. Relevance Estimation TargetTo estimate the relevance of each pair of user-POI TargetTo estimate the relevance of each pair of user-POI, we use these features to learn a Regression-Tree Model. User ID POI ID SF PP IP Relevance 1 A 0.2 0.1 0.001 3 1 B 0.05 0.2 0.1 51 B 0.05 0.2 0.1 5 1 C 0.004 0.1 0.9 1 … … … … … … N D 0.5 0.15 0.06 2 Regression-Tree Model
  58. 58. Experimental Evaluation  Real dataset crawled from Gowalla p  in New York City area  1,964,919 POIs, ,  18,159 people  5 341 191 Check-ins 5,341,191 Check-ins  392,246 Friendship Links
  59. 59. Comparisons with Otherp Recommenders
  60. 60. Better way for modeling?Better way for modeling? - UPOI-Walk- UPOI-Walk In ACM Transactions on Intelligent Systems and Technologies, 2014
  61. 61. Motivation  The existing models could not deal with such h f llheterogeneous features well  The existing models try to combine all features into f b ildi i l d l Bi !one measure for building a single model → Bias! Relevance LearningRelevance Learning LBSN Dataset Social Factor Individual Preference User-POI Graph Construction Hits-based Random Walk LBSN Dataset Social Factor Individual Preference User-POI Graph Construction Hits-based Random Walk Feature Extraction Preference POI Popularity User-POI Graphs User-POI Relevance Matrix Feature Extraction Preference POI Popularity User-POI Graphs User-POI Relevance Matrix User Request Top k Nearest POI selection Top k Nearest POI POI RankingUser Request Top k Nearest POI selection Top k Nearest POI POI Ranking POI Recommending ListPOI Recommendation POI Recommending List
  62. 62. Random Walk Model
  63. 63. HITS-based Random Walk X C t l “Mi i i ifi t ti Random Walk X. Cao , et al., “Mining significant semantic locations from GPS data,” Proceedings of the VLDB Endowment, v.3 n.1-2, September 20102010 0.3 0 2 0 10.2 0.4 0.1 Given an m × n hits value matrix MGiven an m × n hits value matrix M 11 1 1 ))1(( ))1((     kk k user T col k POI xMx xMx   HITS-based Random Walk 2 ))1((  POIrowuser xMx 
  64. 64. Dynamic HITS-Based Random Walky X N X Y Network Set = {M,N,X,Y,Z,…}M ZZ Randomly select vPOI k1  (Mcol T (1)1)vuser k vk1  (N (1 ) )vk1 …… hits value matrixes from Network Set vuser  (Nrow (1)2 )vPOI vPOI k2  (Xcol T (1)1)vuser k1 vk3  (Y (1)2 )v O k2 … vuser (Yrow (1 )2 )vPOI vPOI k3  (Zcol T (1)1)vuser k2 … till converged
  65. 65. Comparison with Existing R d NDCGRecommenders - NDCG
  66. 66. Beautiful algorithms matter a lot still for Big Data Analytics! 67
  67. 67. Some Key Challengesy g  Data Preprocessing Phase Data Preprocessing Phase  Data quality problem: Noise, Incompleteness, Sparsity  Veracity issue: Is bigger the better?y gg  Data Understanding Phase  Key Features Discovery: Finding the needle in a haystack  Learning and Modeling Phase  Timeliness vs. Precision: Issues for data sampling  Need of more sophisticated methodologies  Post-processing Phase 68
  68. 68. 醫療雲計畫醫療雲計畫 69
  69. 69. 全民健保資料加值計劃 70
  70. 70. Early Prediction of Diseasesy Huizinga, T. W. J., & van der Helmvan Mil, A. H. M. (2007). Prediction and prevention of rheumatoid arthritis. Revista Colombiana de Reumatología, 14(2), 106-114. Early RA 12 month RA DiagnosisEarly RA 18 month Very Early Detection ~ X years 71 ye s
  71. 71. Analytics Frameworky Data miningTarget PreprocessedRaw techniquesdata datadata Di d Off Classifier Discovered Rules Off- line On-line Morbidity Risk Prediction S Health records Potential Patient Doctor / Hospital System Predicted risk 72
  72. 72. Rules Produced Too many rules! Postprocessing is essential! 73 73
  73. 73. Post-Processing – Rules FilteringPost Processing Rules Filtering Rules: Lift > 1: 11,004 Rules Lift = 1: 357 Rules Lift < 1: 7,543 Rules 74
  74. 74. Postprocessing: Literature Search (Pubmed) Acute laryngopharyngitis Manic disorder neoplasm of breast Adhesive capsulitis of shoulder 0 0 0 0 0 0 0 0 decubitus urination Vaginitis Kaschin lumbar intervertebral disc Pterygium 6 5 4 3 2 2 1 1 1 1 1 conjunctivitis Cervical spondylosis capsulitis Spinal stenosis Calculus decubitus 26 24 21 20 17 16 13 12 11 7 7 6 bronchitis rhinitis Fasciitis Allergic rhinitis Coronary atherosclerosis j 62 60 58 55 52 44 43 43 41 29 26 Peptic Peptic ulcer cataract Sicca syndrome Dyspepsia tract infection 156 123 118 116 113 113 105 90 77 73 72 6 Anxiety neuropathy dermatitis Sleep nephropathy Peptic 375 323 301 296 279 271 270 257 248 225 166 156 75 Systemic lupus erythematosus Diabetes Osteoporosis asthma breast y 4557 2337 2043 1982 1392 1328 748 592 394 375
  75. 75. A More Complete Framework (i OS O 201 )(in PLOS One 2015)
  76. 76. After Postprocessing: Interesting Rules 77
  77. 77. How to summarize/validate/interpret the discovered results is important last-mile for Big Data Analytics! 78
  78. 78. Concluding Remarks: G d O i iGrand Opportunities  “Data is King”: Age of data monetization Data is King : Age of data monetization  Data vs. Ideas vs. Technologies  From Data to Idea  From Idea to data  Utilization of right technologies  Visioning Visioning  擁有價值性資料者可以為王  不擁有資料但有創新點子的人易可稱王 不擁有資料但有創新點子的人易可稱王  Innovative Ideas + Right Tech on Valued Data => Smart King 7979 Smart King
  79. 79. Grand Challenges Big Opportunities!Grand Challenges, Big Opportunities! 81
  80. 80. Thanks for your attentionf y

×