Examples of working with streaming data

Examples of Working
with Streaming Data
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com

Hello
陳宜欣 Yi-Shin Chen
 Currently
 Associate professor at NTHU CS
 Director of IDEA Lab
 Education
 Ph.D. in Computer Science, USC, USA
 M.B.A. in Information Management, NCU, TW
 B.B.A. in Information Management, NCU, TW
 Courses
 Introduction to Database Systems
 Advanced Database Systems
 Data Mining: Concepts, Techniques, and
Applications
2

Research Focus from 2000
Storage
Index
Optimization
Query
Mining
DB

Streaming Data
What should we know?

Streaming Data
Continuous flow
 E.g.,
Infinite length
 Impractical to store and use all historical data
Concept drift
 Not wise to use all historical data
Stock Volume
Sensor Data
Social Stream

6
Continuous Queries
Stream DB
Acquisition
Process
Raw data &
Transformation of
Raw Stream
Transformation of
Raw Stream
Continuous
Query
Process
Crowd Wisdom
Rules/Patterns
Continuously Provide Feedback
Three major approaches for continuous queries
•Fast on-line classification/clustering
•Sliding window
•Range aggregation

Example 1
Auto-identify the Influence of Events Based on
Stock News

Framework of Off-line Training Module
Acquisition
Process
Acquisition
Process
Crowd Wisdom
Rules/Patterns

Alignment
Industry:
Finance
Industry:
Textile
Industry:
Car
………
….
𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [𝑃 𝑓𝑖𝑛𝑎𝑛𝑐𝑒, 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒, … … , 𝑃 𝑐𝑎𝑟]
於2011年4月在上海車展首度現身的Luxgen
Neora概念車，不但是國產自主品牌Luxgen自
創立以來，首度推出的第一輛概念車款……
𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [0, 0, … … , 3]
Comp-
anies
Related
words
Comp-
anies
Related
words
Comp-
anies
Related
words
𝑃𝑓𝑖𝑛𝑎𝑛𝑐𝑒 =0 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒 =0 𝑃𝑐𝑎𝑟 = 3

Itemset Production
日本+地震日本+救災
日本+地震日本+淹水
日本+地震日本+影響
日本+地震日本+預估
日本+地震日本+破壞
日本+購買日本+旅遊
…
…
…
…
…
…
…
…
…
…
…
…
The confidence of
日本+地震:
The number of 日本+地震
appears in all transactions:
𝑢 𝑠
The number of 日本 appears
in all transactions:
𝑛 𝑝
The confidence of 日本+地
震 :
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑢 𝑠
𝑛 𝑝
=
5
6
Group

Representative Itemset Selection
Select itemsets based on high confidence as a
candidate of representative itemset.
𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑥 ∗ 𝑡𝑓𝑖𝑑𝑓1 + 𝑦 ∗ 𝑡𝑓𝑖𝑑𝑓2 + 𝑧 ∗ 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
日本+地震日本+預估核能+外洩危機+發生
日本地震預估核能外洩危機發生
0.22 0.25 0.03 0.18 0.2 0.10 0.001
日本+地震日本+預估核能+外洩危機+發生
0.833 1 0.667 0.667
日本,地震,核能,外洩
Concept

Concept Verification
By considering:
 The daily frequency of concept 𝐶𝑗
 The concept index 𝐶𝐼𝑗 of 𝐶𝑗
 Regression model based on price within sliding windows
If p-value reject 𝐻0, the concept 𝐶𝑗 will be
considered as an influential event

On-line Prediction Module
Regression prediction
 Use most frequent event.
Adjust regression prediction
 Include other events which is not the most frequent.
Pheromone prediction
 Include the past influence.
Continuous
Query
Process

Experimental Data
 Stock data
 Industry index from TWSE.
 2012-01-01 to 2012-05-11
 News data
 Crawl the news form website.
 Yahoo!, udn, Libertytimes, PCHome, etc.
 Total 13 websites.
 2012-01-01 to 2012-05-11
 More than 150,000 news.
 All the news is in Traditional Chinese.

Experimental Setup
Four methods to predict the market:
 Pheromone prediction model
 Adjust regression prediction model
 Regression prediction model
 Blind test.
Prediction
policy: fall rise
NSM
(no significant move)

Performance
Accuracy of four methods:
Methods Average
Accuracy
Pheromone 0.5784574
Adjust
regression
0.5323214
Regression 0.5134457
Blind test 0.3045479

Performance
Is it work on the whole market?
 It catches our attention on using event to predict the
whole market by aggregate all the industry into all.
Type Accuracy
Pheromone 0.6315789
Adjust Regression 0.6896511
Regression 0.5714285

Example2
An Interactive Conducting System
Using Motion Detector

Motivation
Diversify human computer interaction
technology with multimedia
 Music education
 Music experiment
 Amateur and professional conductors
 Composers
 Personal amusement
19

Devices
 Build an interactive conducting system using motion
Microsoft Kinect
20
3D Depth Sensors

Conducting Data (Data Streams)
 Cartesian coordinate (x,y,z)
 30 Frames per second under 320x240 resolution
 delay 33 ms (1/30 second)
 Human eyes can process 10 to 12 frames per second [2]
 delay ≈ 100 ms (1/10 second)
22
+Y
+X
Z
Sensor Direction
-X
-Y

Framework
23
Conducting Data
Received
Beat Pattern
Recognition
Whole Measure
Volume Identify Instrument Emphasis
Relative height of hand Tilt Z-Mapping
Volume Adjustment
According to
Instrument Emphasis
Tempo Adjustment
According to
Instrument Emphasis
YesStop Gesture
Recognition
Initial System
PlayStatus = False
Is
PlayStatus
true
No
Is
Stop
true
Is
Start
true
Yes
PlayStatus
= False
No Yes
PlayStatus
= True No
Start Gesture
Recognition
Acquisition
Process
Crowd Wisdom
Rules/Patterns
Offline Analysis
Continuous
Query
Process

Experiments
24
 Evaluation
 Beat pattern and measure recognition
 Volume control and instrument emphasis recognition
 Response time
 Experimental Setup
 Participants
 1 professional
 8 had no experience
 Practice
 30 minutes

Beat Pattern and Measure Recognition Evaluation
25
0.7826
0.86480.8438
0.8821
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Professional No Experiece
RecognitionRate
Recall
Precision

Instrument Emphasis
26
 Adjust volume in the correct instrument sections
1 0.9375 1
0.8666
1 11 1 1 0.9286 1 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RecognitionRate
Recall
Precision

Example3
Social Stream Analysis for Location Identification

Goal
Identify the location of a particular Twitter
user at a given time
 Using exclusively the content of his/her tweets
28

Major Challenges
Twitter Challenges
 Tweets are noisy
 Extensive use of non-standard vocabulary
 Bots and spammers
Geo-locational Challenges
 Users might have several associated locations
 Toponyms
 Scarce information
 False profile information
29

Framework
Acquisition
Process
Crowd Wisdom
Rules/Patterns
Continuous
Query
Process

Experimental Setup
 Original Dataset 1.53 M Twitter users and 13 M tweets
 3,314 Twitter users and 2.2 M tweets
 104,054 geo-tagged tweets
 Although we collected and processed data carefully, it still
needed to be validated
• Use of Local Experts
– People familiar with the geography of the country
Original
Tweets
Subject
Identification
Location
Discovery Tweets
Toponyms
Removal
Timeline
Sorting
Final
Results
329,814 57,153 18,662 9,093 6,928 2,165

Evaluation
Recruited an international work force from
 Crowdsourcing with good reputation

Example4
Social Stream Analysis for Event Identification

Introduction
By analyzing social streams, it can benefit in
 Emergency control
 Crowd opinion analysis
 Unreported events detection
Motivation: event identification from social
streams
35

Methodology
36
Tweets Data
Preprocess
Keyword
Selection
Event Candidate
Recognition
Event
Candidates
User Social
Structures
Evolving Social
Graph Analysis
Event
Identification
Acquisition
Process
Continuous
Query
Process
Offline Analysis
Crowd Wisdom
Rules/Patterns

Methodology – Keyword Selection
Well-noticed criterion
 Compared to the past, if a word suddenly be
mentioned by many users, it is well-noticed
 Time Frame – a unit of time period
 Sliding Window – a certain number of past time frames
time
tf0 tf1 tf2 tf3 tf4
37

Methodology –
Event Candidate Recognition
Idea: group one keyword with its most relevant
keywords into one event candidate
38
boston
explosion confirm
prayerbombing boston-
marathon
threat
iraq
jfk
hospital
victim afghanistan
bomb
america

Methodology –
Evolving Social Graph Analysis
 Information decay:
 Vertex weight, edge weight
 Decay mechanism
 Concept-Based Evolving Graph Sequences (cEGS):
a sequence of directed graphs that demonstrate
information propagation
tf1 tf2 tf3
39

Experiment
Testing
 Events identified in November 2013
 Evaluated by 7 human experts
40
Average precision 86.64%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nov_2
Nov_3
Nov_4
Nov_5
Nov_6
Nov_7
Nov_8
Nov_10
Nov_11
Nov_12
Nov_13
Nov_14
Nov_15
Nov_16
Nov_17
Nov_18
Nov_19
Nov_22
Nov_23
Nov_24
Nov_25
Nov_26
Nov_27
Nov_28
Nov_29
Nov_30
Precision
Date

Example 5
Social Stream Analysis for Mental Disorder Detection

Introduction
18.1% people suffer from mental disorder in United States (*)
Using Social Network to research on Mental Disorder
National Insititute of Mental Helath:
http://www.nimh.nih.gov/health/statistics/prevalence/index.shtml
Analyze

Background
Bipolar Disorder:
*Unstable and impulsive emotions
Cycling between Maniac and Depression
episodes
Borderline Personality Disorder:
*Unstable and impulsive emotions
Impaired social interactions

Framework Acquisition
Process
Crowd Wisdom
Rules/Patterns

Collect Patient Data
45
Support
Group

46
Followers

49
Wait!
Control
Group
Needed

Collect Data from Ordinary People
50

51

52

Basic Guidelines
 Identify the common and differences between
the experimental and control groups
 Word/pattern frequency
 Emotion related data (e.g., flipping rates, occurrence rates)
 Social interaction (e.g., retweet, reply)
 Lifestyle (e.g., online time, stay-up or not)
 Age and gender
Features
53

Apply Classifiers (Online)
 By utilize the extracted features
 Various classifiers
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
 Random forest
54
Continuous
Query
Process

Possible Continuous Query Results
56

More in the future…
Thank you.
Contact me at:
yishin@gmail.com

Examples of working with streaming data

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Examples of working with streaming data

Similar to Examples of working with streaming data (20)

More from Yi-Shin Chen

More from Yi-Shin Chen (17)

Recently uploaded

Recently uploaded (20)

Examples of working with streaming data

Editor's Notes