Successfully reported this slideshow.
Your SlideShare is downloading. ×

Examples of working with streaming data

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 57 Ad

More Related Content

Similar to Examples of working with streaming data (20)

Advertisement

More from Yi-Shin Chen (17)

Recently uploaded (20)

Advertisement

Examples of working with streaming data

  1. 1. Examples of Working with Streaming Data Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University yishin@gmail.com
  2. 2. Hello 陳宜欣 Yi-Shin Chen  Currently  Associate professor at NTHU CS  Director of IDEA Lab  Education  Ph.D. in Computer Science, USC, USA  M.B.A. in Information Management, NCU, TW  B.B.A. in Information Management, NCU, TW  Courses  Introduction to Database Systems  Advanced Database Systems  Data Mining: Concepts, Techniques, and Applications 2
  3. 3. Research Focus from 2000 Storage Index Optimization Query Mining DB
  4. 4. Streaming Data What should we know?
  5. 5. Streaming Data Continuous flow  E.g., Infinite length  Impractical to store and use all historical data Concept drift  Not wise to use all historical data Stock Volume Sensor Data Social Stream
  6. 6. 6 Continuous Queries Stream DB Acquisition Process Raw data & Transformation of Raw Stream Transformation of Raw Stream Continuous Query Process Crowd Wisdom Rules/Patterns Continuously Provide Feedback Three major approaches for continuous queries •Fast on-line classification/clustering •Sliding window •Range aggregation
  7. 7. Example 1 Auto-identify the Influence of Events Based on Stock News
  8. 8. Framework of Off-line Training Module Acquisition Process Acquisition Process Crowd Wisdom Rules/Patterns
  9. 9. Alignment Industry: Finance Industry: Textile Industry: Car ……… …. 𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [𝑃 𝑓𝑖𝑛𝑎𝑛𝑐𝑒, 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒, … … , 𝑃 𝑐𝑎𝑟] 於2011年4月在上海車展首度現身的Luxgen Neora概念車,不但是國產自主品牌Luxgen自 創立以來,首度推出的第一輛概念車款…… 𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [0, 0, … … , 3] Comp- anies Related words Comp- anies Related words Comp- anies Related words 𝑃𝑓𝑖𝑛𝑎𝑛𝑐𝑒 =0 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒 =0 𝑃𝑐𝑎𝑟 = 3
  10. 10. Itemset Production 日本+地震 日本+救災 日本+地震 日本+淹水 日本+地震 日本+影響 日本+地震 日本+預估 日本+地震 日本+破壞 日本+購買 日本+旅遊 … … … … … … … … … … … … The confidence of 日本+地震: The number of 日本+地震 appears in all transactions: 𝑢 𝑠 The number of 日本 appears in all transactions: 𝑛 𝑝 The confidence of 日本+地 震 : 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑢 𝑠 𝑛 𝑝 = 5 6 Group
  11. 11. Representative Itemset Selection Select itemsets based on high confidence as a candidate of representative itemset. 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑥 ∗ 𝑡𝑓𝑖𝑑𝑓1 + 𝑦 ∗ 𝑡𝑓𝑖𝑑𝑓2 + 𝑧 ∗ 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 日本+地震 日本+預估 核能+外洩 危機+發生 日本 地震 預估 核能 外洩 危機 發生 0.22 0.25 0.03 0.18 0.2 0.10 0.001 日本+地震 日本+預估 核能+外洩 危機+發生 0.833 1 0.667 0.667 日本,地震,核能,外洩 Concept
  12. 12. Concept Verification By considering:  The daily frequency of concept 𝐶𝑗  The concept index 𝐶𝐼𝑗 of 𝐶𝑗  Regression model based on price within sliding windows If p-value reject 𝐻0, the concept 𝐶𝑗 will be considered as an influential event
  13. 13. On-line Prediction Module Regression prediction  Use most frequent event. Adjust regression prediction  Include other events which is not the most frequent. Pheromone prediction  Include the past influence. Continuous Query Process
  14. 14. Experimental Data  Stock data  Industry index from TWSE.  2012-01-01 to 2012-05-11  News data  Crawl the news form website.  Yahoo!, udn, Libertytimes, PCHome, etc.  Total 13 websites.  2012-01-01 to 2012-05-11  More than 150,000 news.  All the news is in Traditional Chinese.
  15. 15. Experimental Setup Four methods to predict the market:  Pheromone prediction model  Adjust regression prediction model  Regression prediction model  Blind test. Prediction policy: fall rise NSM (no significant move)
  16. 16. Performance Accuracy of four methods: Methods Average Accuracy Pheromone 0.5784574 Adjust regression 0.5323214 Regression 0.5134457 Blind test 0.3045479
  17. 17. Performance Is it work on the whole market?  It catches our attention on using event to predict the whole market by aggregate all the industry into all. Type Accuracy Pheromone 0.6315789 Adjust Regression 0.6896511 Regression 0.5714285
  18. 18. Example2 An Interactive Conducting System Using Motion Detector
  19. 19. Motivation Diversify human computer interaction technology with multimedia  Music education  Music experiment  Amateur and professional conductors  Composers  Personal amusement 19
  20. 20. Devices  Build an interactive conducting system using motion Microsoft Kinect 20 3D Depth Sensors
  21. 21. Challenges 21 1 2 3 4 1 2 3 4
  22. 22. Conducting Data (Data Streams)  Cartesian coordinate (x,y,z)  30 Frames per second under 320x240 resolution  delay 33 ms (1/30 second)  Human eyes can process 10 to 12 frames per second [2]  delay ≈ 100 ms (1/10 second) 22 +Y +X Z Sensor Direction -X -Y
  23. 23. Framework 23 Conducting Data Received Beat Pattern Recognition Whole Measure Volume Identify Instrument Emphasis Relative height of hand Tilt Z-Mapping Volume Adjustment According to Instrument Emphasis Tempo Adjustment According to Instrument Emphasis YesStop Gesture Recognition Initial System PlayStatus = False Is PlayStatus true No Is Stop true Is Start true Yes PlayStatus = False No Yes PlayStatus = True No Start Gesture Recognition Acquisition Process Crowd Wisdom Rules/Patterns Offline Analysis Continuous Query Process
  24. 24. Experiments 24  Evaluation  Beat pattern and measure recognition  Volume control and instrument emphasis recognition  Response time  Experimental Setup  Participants  1 professional  8 had no experience  Practice  30 minutes
  25. 25. Beat Pattern and Measure Recognition Evaluation 25 0.7826 0.86480.8438 0.8821 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Professional No Experiece RecognitionRate Recall Precision
  26. 26. Instrument Emphasis 26  Adjust volume in the correct instrument sections 1 0.9375 1 0.8666 1 11 1 1 0.9286 1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 RecognitionRate Recall Precision
  27. 27. Example3 Social Stream Analysis for Location Identification
  28. 28. Goal Identify the location of a particular Twitter user at a given time  Using exclusively the content of his/her tweets 28
  29. 29. Major Challenges Twitter Challenges  Tweets are noisy  Extensive use of non-standard vocabulary  Bots and spammers Geo-locational Challenges  Users might have several associated locations  Toponyms  Scarce information  False profile information 29
  30. 30. Framework Acquisition Process Crowd Wisdom Rules/Patterns Continuous Query Process
  31. 31. Experimental Setup  Original Dataset 1.53 M Twitter users and 13 M tweets  3,314 Twitter users and 2.2 M tweets  104,054 geo-tagged tweets  Although we collected and processed data carefully, it still needed to be validated • Use of Local Experts – People familiar with the geography of the country Original Tweets Subject Identification Location Discovery Tweets Toponyms Removal Timeline Sorting Final Results 329,814 57,153 18,662 9,093 6,928 2,165
  32. 32. Evaluation Recruited an international work force from  Crowdsourcing with good reputation
  33. 33. General Statistics
  34. 34. Example4 Social Stream Analysis for Event Identification
  35. 35. Introduction By analyzing social streams, it can benefit in  Emergency control  Crowd opinion analysis  Unreported events detection Motivation: event identification from social streams 35
  36. 36. Methodology 36 Tweets Data Preprocess Keyword Selection Event Candidate Recognition Event Candidates User Social Structures Evolving Social Graph Analysis Event Identification Acquisition Process Continuous Query Process Offline Analysis Crowd Wisdom Rules/Patterns
  37. 37. Methodology – Keyword Selection Well-noticed criterion  Compared to the past, if a word suddenly be mentioned by many users, it is well-noticed  Time Frame – a unit of time period  Sliding Window – a certain number of past time frames time tf0 tf1 tf2 tf3 tf4 37
  38. 38. Methodology – Event Candidate Recognition Idea: group one keyword with its most relevant keywords into one event candidate 38 boston explosion confirm prayerbombing boston- marathon threat iraq jfk hospital victim afghanistan bomb america
  39. 39. Methodology – Evolving Social Graph Analysis  Information decay:  Vertex weight, edge weight  Decay mechanism  Concept-Based Evolving Graph Sequences (cEGS): a sequence of directed graphs that demonstrate information propagation tf1 tf2 tf3 39
  40. 40. Experiment Testing  Events identified in November 2013  Evaluated by 7 human experts 40 Average precision 86.64% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Nov_2 Nov_3 Nov_4 Nov_5 Nov_6 Nov_7 Nov_8 Nov_10 Nov_11 Nov_12 Nov_13 Nov_14 Nov_15 Nov_16 Nov_17 Nov_18 Nov_19 Nov_22 Nov_23 Nov_24 Nov_25 Nov_26 Nov_27 Nov_28 Nov_29 Nov_30 Precision Date
  41. 41. Example 5 Social Stream Analysis for Mental Disorder Detection
  42. 42. Introduction 18.1% people suffer from mental disorder in United States (*) Using Social Network to research on Mental Disorder National Insititute of Mental Helath: http://www.nimh.nih.gov/health/statistics/prevalence/index.shtml Analyze
  43. 43. Background Bipolar Disorder: *Unstable and impulsive emotions Cycling between Maniac and Depression episodes Borderline Personality Disorder: *Unstable and impulsive emotions Impaired social interactions
  44. 44. Framework Acquisition Process Crowd Wisdom Rules/Patterns
  45. 45. Collect Patient Data 45 Support Group
  46. 46. Collect Patient Data 46 Followers
  47. 47. Collect Patient Data 47
  48. 48. Collect Patient Data 48
  49. 49. Collect Patient Data 49 Wait! Control Group Needed
  50. 50. Collect Data from Ordinary People 50
  51. 51. Collect Data from Ordinary People 51
  52. 52. Collect Data from Ordinary People 52
  53. 53. Basic Guidelines  Identify the common and differences between the experimental and control groups  Word/pattern frequency  Emotion related data (e.g., flipping rates, occurrence rates)  Social interaction (e.g., retweet, reply)  Lifestyle (e.g., online time, stay-up or not)  Age and gender Features 53
  54. 54. Apply Classifiers (Online)  By utilize the extracted features  Various classifiers  Neural Networks  Naïve Bayes and Bayesian Belief Networks  Support Vector Machines  Random forest 54 Continuous Query Process
  55. 55. Precisions 55
  56. 56. Possible Continuous Query Results 56
  57. 57. More in the future… Thank you. Contact me at: yishin@gmail.com

Editor's Notes

  • 由於我們希望分析事件對不同產業的影響性,因此在category alignment中,我們希望將新聞依照產業區分
  • 在蒐集了各產業的相關氣業以及相關文字之後,我們使用這些資料作為alignment的依據。

    當一篇新聞出現的時候,我們賦予每篇新聞一個 belong vector,這個vector 裡面所有的值代表著對於每個產業的適合度

    若該文字出現在某一個產業內,那個產業的適合度就會加一

    在這個vector 中的值,越大表示越適合

    我們的策略在一開始,是選擇最大的值做為新聞分派的產業,這個做法有個缺陷,因為事件不僅僅會影響一個產業

    因此我們希望從這個vector中,挑選出相對顯著的產業

    這邊我們使用的方法是outlier detection,利用boxplot選出值較大的outlier作為新聞部屬的產業
  • 我們希望把這些group裡面常出現的keyword set找出來,這些set 就是我們的itemset。
  • 一個itemset 中的feature 如以下所示

    找出itemset不僅是confidence 高,而且keyword也相對重要。

    我們希望透過一個自動 training 的方法,幫助我們找出較好的weight來挑選itemset
  • 控制變數
    反應變數
  • 我們總共有三種不同的驗證方式,在不同的驗證中,有不同假設。
  • 我們需要一個指揮系統的其中一個目的是要讓我們對於多媒體資訊的互動方式更加多樣化。
    舉例來說,一個指揮系統在學校可以讓學生指揮家自行練習。
    那業餘指揮家、專業指揮家及編曲家都可以使用指揮系統來模擬及創造音樂的多樣性。
    音樂指揮家系統投入到個人娛樂事業也形成了一種趨勢,我們可以從 Wii music 看得出來。
  • 所以我的研究目標就是使用 Microsoft Kinect 來建立一套互動式的指揮系統,讓指揮家不需要穿戴任何電子儀器,也沒有使用影像處理的一些限制。
    選擇 Microsoft Kinect 的原因是因為它有先進的骨架追踪技術,Kinect 上面有 2 個 3D 的 depth sensors,這 2 個 depth sensors 可以回傳環境物體在空間中的深度資料,並且可以追踪到環境人體的骨架。
    還有就是它很便宜,所以我們做出來的系統比較容易普及化。
  • 從這兩張圖中我們可以看得出來,左邊是理想中的 4 拍拍型,右邊是從 Kinect 上讀到的 4 拍拍型軌跡,從 Kinect 讀到的拍型軌跡會有一些點因為手跟身體重疊所以無法辨識。
    於是我們不能單純只用 x,y,z 座標來找出拍型,我們設計出來的演算法需要辨識不同拍型,而且要在資料有遺失的狀況下正常運作
    所以我們整理了各個拍型之間的特徵來協助我們找出不同的拍型,這些特徵會在之後的實作方法提到。
  • Kinect 可以同時偵測到一個使用者身上的20個關節點。
    Kinect 會將這些關節點以 Cartesian coordinate 的座標系統回傳,該座標系統的原點是脊椎,所以我們的 input data 是一連串的關節點座標。
    我們使用 Kinect 的深度圖像是最低的解析度 320x240,因為在這個解析度下 Kinect 一秒鐘會回傳的 Frames 是最多的,也就是一秒鐘會回傳30個 frames,所以資料從 Kinect 傳到我們的系統的會有 33 millisecond 的 delay。
    那人眼一秒鐘可以處理 10 到 12 個 frames,delay 大約是 100 millisecond,所以其實 Kinect 的傳輸 delay,不會對系統造成太大的影響,影響較大是演算法的設計,所以我們會盡量減低運算量來維持系統的即時性。
    一般指揮系統會偵測指揮家的上半身,我們的演算法只會用到上半身的其中 6 個關節點。
    這 6 個點是左右手、左右肩膀、頭及脊椎。
    (human reaction and response time delay is 100 millisecond)
  • 這是我整個指揮系統的 framework,一開始因為要指揮的歌曲還沒有播放,所以系統初始化的時候會將 PlayStatus 這個 flag 賦予 false 的值。
    接下來如果有使用者站在 Kinect 前面,Kinect 就會追踪使用者的骨架並將座標回傳到系統。
    這個時候系統就會判斷歌曲是否已經播放,如果還沒有,系統就會到 start gesture recognition 來等待使用者開始的手勢。
    當開始的手勢被辨認以後,系統會播放音樂並將 PlayStatus 設定成 true,然後根據使用者的指揮軌跡來調整音量,控制樂器聲部及 調整音樂tempo。
    當歌曲在播放的時候,系統會偵測使用者是否有停止的手勢,如果停止的手勢被辨認出來,那系統會將 PlayStatus 設定成 false,並等待使用者下一個開始的手勢。
    那我們現在先來看一下 Kinect 回傳回來的 data。
  • 我們的實驗主要分成 3 部分,首先是拍型及小節的辨識
    接下來是音量調整及聲部強調的辨識。
    最後是演算法的反應時間。
    因為時間的關係,所以參與這些測試的只有 1 位專業的指揮家及 8 位沒有指揮經驗的同學,之後我們會在找其他專業指揮家來幫我們測試。
    在測試之前我們會先讓使用者使用我們的系統半個小時,讓使用者熟悉我們的系統才開始測試。
    首先我們來看拍型及小節辨識的部分
  • 左邊是專業指揮家的辨識結果,右邊是其他沒有指揮經驗的辨識結果
    專業指揮家的 recall 相對較低的原因是因為我們只使用一台 Kinect 所以當指揮家身體往左或往右做聲部強調的時候無法偵測到使用者的拍型軌跡。
  • Recall = 成功改變的次數/使用者想改變的次數
    Precision = 成功改變的次數/系統偵測到改變的次數

    在音量調整及聲部強調的部分,我們可以看到只有敲擊樂器及銅管樂器的 precision 及 recall 相對較低,原因是這兩個聲部都是在虛擬樂團的後方,當指揮家要強調後方聲部,在把手伸回來,很容易就不小心改變到它前面的聲部。
    那因為左右聲部都有緩衝區,所以辨識的錯誤率會比中間的聲部來得低。
  • Similarly, in order to measure the effectiveness of our method, the results of the Hometown dataset were split into “Factual” and “Empty | Fictional”

    -The first category refers to those profiles in which the user has explicitly stated his location
    as a valid point. Belonging to the second category, are those profiles whose location is listed as empty, fictional, or overbroad

    -WMAE: Workers MAE
    -Tw MAE: Tweet MAE

    -Workers would usually agree on the city , but not on the area as a result of their perception.
    On a general basis, the error distance remained low. Also for reallocated tweets
    TW mae remain low as compared to the area of united states 3.1 million square miles
  • diagram
  • 1.No decay example
    2.Decay example
  • Y to max 1
  • Stronger motivation: We can actively

×