Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semantic Approach to
Big Data and Event Processing
Integrating Sensor and Social Data
for Understanding City Events
Pramod...
Slow moving
traffic
Link
Description
Scheduled
Event
Scheduled
Event
511.org
511.org
Schedule Information
511.org
2
3
• Why?
– Provides Complementary information for
comprehensive situational awareness
• Sensor : Social :: Quantitative vs Q...
• Why?
– Explain/Interpret average speed and link travel time
data using event schedule provided by city authorities
and r...
• How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numer...
• How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numer...
8
Various City Events Reported on Twitter
Some Challenges in Extracting Events from Tweets
• No well accepted definition of ‘events related to a
city’
• Tweets are ...
Formal Text Informal Text
Closed Domain
Open Domain [Roitman et al. 2012][Kumaran and Allan 2004]
[Lampos and Cristianini ...
11
[ABTA-14] Pramod Anantharam, Payam Barnaghi, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. Extracting City Traffic ...
• City Event Annotation
– Automated creation of training data
– Annotation task (our CRF model vs. baseline CRF
model)
• C...
13
Evaluation Metric For Comparing Events with Ground Truth:
• Complementary Events
• Additional information e.g., slow tr...
14
Complementary Events
Complementary Events
Complementary Events
15Corroborative Events
Corroborative Events
Corroborative Events
16
Timeliness
Timeliness
Evaluating Timeliness
• How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numer...
Image credit: http://traffic.511.org/index
Multiple events
Varying influence
interact with each other
Focus of this talk: ...
• Causes of non-linearity in sensor data
streams
– Temporal landmarks : peak hour vs off-peak traffic
vs weekend traffic
–...
• Disclaimer
"All models are wrong, but some are useful.” - George Box
• Normalcy Model
– Gaussian Mixture Model (GMM)
• C...
Image credit: http://tourontap.com/us-open-2012/courses-and-more-by-the-bay/
AT&T Park
21
Histogram of speed values
collected from June 1st 12:00 AM to June 2nd 12:00 AM
Histogram of travel time values
collected ...
Most of the drivers tend to
go 5 km/h over the posted speed limit
There are relatively less drivers who
go more than 10 km...
“many variables such as height, weight, IQ scores, reading ability, job satisfaction,
blood pressure turn out to have dist...
25
Multiple Gaussian Distributions: A Better Fit for Speed Observations?
This distribution resembles a
Gaussian Mixture Mo...
Assume Normalcy to be uninterrupted traffic flow
July 2014 has no events so, we
hypothesize higher log-likelihood
score
Ju...
27
Hourly Traffic Dynamics Over a Day
• Differentiate various traffic dynamics
– Gaussian mixture model is too course grained as it does not discriminate
betwee...
• Characterize data time series (by learning
distribution of each time point behavior using
mean and variance)
• Pick a re...
30
Learning LDS Models
31
Tagging Anomalies with LDS Models
• Normalcy : Log Likelihood scores of traces from event free data visualized
as box and whiskers plot
– Intertwined with l...
• How?
– Extract events from textual tweets stream
– Build statistical models of normalcy, and thereby
anomaly, from numer...
• If an anomaly is detected on a link L and during time
period [tst, tet], then the anomaly is explained by an
event if th...
• Data collected from San Francisco Bay Area between May 2014 to May
2015
– 511.org:
• 1,638 traffic incident reports
• 1....
36
Evaluation Results
Semantic Approach to
Big Data and Event Processing
Thank you!
Any Question?
Upcoming SlideShare
Loading in …5
×

Integrating Sensor and Social Data for Understanding City Events

238 views

Published on

Integrating Sensor and Social Data for Understanding City Events
Pramod Anantharam

Published in: Data & Analytics
  • Be the first to comment

Integrating Sensor and Social Data for Understanding City Events

  1. 1. Semantic Approach to Big Data and Event Processing Integrating Sensor and Social Data for Understanding City Events Pramod Anantharam Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) Wright State University, USA Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
  2. 2. Slow moving traffic Link Description Scheduled Event Scheduled Event 511.org 511.org Schedule Information 511.org 2
  3. 3. 3
  4. 4. • Why? – Provides Complementary information for comprehensive situational awareness • Sensor : Social :: Quantitative vs Qualitative – Corroboration can further improve trustworthiness • What? – Collect and relate multimodal sensors data and social media data • How? – Correlate heterogeneous data streams exploiting spatio-temporal proximity and domain knowledge T. K. Prasad 4 Multimodal Data Integration
  5. 5. • Why? – Explain/Interpret average speed and link travel time data using event schedule provided by city authorities and real-time traffic events shared on Twitter – Past work: Predict congestion based on historical sensor data • What? – Combine • 511.org data about Bay Area Road Network Traffic – E.g., Average speed and link travel time data stream – E.g., (Happened or planned) event reports • Tweets that report events including ad hoc ones T. K. Prasad 5 Traffic Domain Use Case (open data)
  6. 6. • How? – Extract events from textual tweets stream – Build statistical models of normalcy, and thereby anomaly, from numerical sensor data streams – Correlate multimodal streams, using spatio- temporal information, to annotate “anomalies” in sensor data time series with textual events T. K. Prasad 6 Traffic Domain Use Case (open data)
  7. 7. • How? – Extract events from textual tweets stream – Build statistical models of normalcy, and thereby anomaly, from numerical sensor data streams – Correlate multimodal streams, using spatio- temporal information, to annotate “anomalies” in sensor data time series with textual events T. K. Prasad 7 Traffic Domain Use Case (open data)
  8. 8. 8 Various City Events Reported on Twitter
  9. 9. Some Challenges in Extracting Events from Tweets • No well accepted definition of ‘events related to a city’ • Tweets are short (140 characters) and its informal nature make it hard to analyze – Entity, location, time, and type of the event • Multiple reports of the same event and sparse report of some events (biased sample) – Numbers don’t necessarily indicate intensity • Validation of the solution is hard due to the open domain nature of the problem 9
  10. 10. Formal Text Informal Text Closed Domain Open Domain [Roitman et al. 2012][Kumaran and Allan 2004] [Lampos and Cristianini 2012] [Becker et al. 2011] [Wang et al. 2012] [Ritter et al. 2012] Related Work on Event Extraction 10
  11. 11. 11 [ABTA-14] Pramod Anantharam, Payam Barnaghi, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. Extracting City Traffic Events from Social Streams. ACM Trans. Intell. Syst. Technol. 6, 4, Article 43 (July 2015), 27 pages. DOI=10.1145/2717317 http://doi.acm.org/10.1145/2717317 City Event Extraction from Textual Data
  12. 12. • City Event Annotation – Automated creation of training data – Annotation task (our CRF model vs. baseline CRF model) • City Event Extraction – Use aggregation algorithm for event extraction – Extracted events AND ground truth • Dataset (Aug – Nov 2013) ~ 8 GB of data on disk – Over 8 million tweets – Over 162 million sensor data points – 311 active events and 170 scheduled events Evaluation 12
  13. 13. 13 Evaluation Metric For Comparing Events with Ground Truth: • Complementary Events • Additional information e.g., slow traffic from sensor data and accident from textual data • Corroborative Events • Additional confidence e.g., accident event supporting a accident report from ground truth • Timeliness • Early detection e.g., knowing poor visibility before its formal report Distribution of Extracted Events Over Locations
  14. 14. 14 Complementary Events Complementary Events Complementary Events
  15. 15. 15Corroborative Events Corroborative Events Corroborative Events
  16. 16. 16 Timeliness Timeliness Evaluating Timeliness
  17. 17. • How? – Extract events from textual tweets stream – Build statistical models of normalcy, and thereby anomaly, from numerical sensor data streams – Correlate multimodal streams, using spatio- temporal information, to annotate “anomalies” in sensor data time series with textual events T. K. Prasad 17 Traffic Domain Use Case (open data)
  18. 18. Image credit: http://traffic.511.org/index Multiple events Varying influence interact with each other Focus of this talk: algorithms to understand these manifestations 18 Correlating Multimodal Streams: Preliminary Insights
  19. 19. • Causes of non-linearity in sensor data streams – Temporal landmarks : peak hour vs off-peak traffic vs weekend traffic – Effect of location – Scheduled events such as road construction, baseball game, or music concert – Unexpected events such as accidents or heavy rains – Random variations (viz., stochasticity) T. K. Prasad 19 Traffic Dependencies
  20. 20. • Disclaimer "All models are wrong, but some are useful.” - George Box • Normalcy Model – Gaussian Mixture Model (GMM) • Captures multiple co-existing events and its impact on traffic – Auto Regressive (AR) Models • Captures temporal dependencies in traffic dynamics – Restricted Switching Linear Dynamical System • Exploits Domain Common Sense for Stationarity • One LDS model per road link per week hour (24 hr x 7 days / week => 168 models) • Anomaly Model – Cf. Box and Whisker plots T. K. Prasad 20 Abstracting Traffic Behavior: Traffic Data Model
  21. 21. Image credit: http://tourontap.com/us-open-2012/courses-and-more-by-the-bay/ AT&T Park 21
  22. 22. Histogram of speed values collected from June 1st 12:00 AM to June 2nd 12:00 AM Histogram of travel time values collected from June 1st 12:00 AM to June 2nd 12:00 AM 22 Traffic Data: First Peek
  23. 23. Most of the drivers tend to go 5 km/h over the posted speed limit There are relatively less drivers who go more than 10 km/h over the posted speed limit There are situations in a day where the drivers are going (forced) below the speed limit e.g., rush hour traffic Do these histograms resemble any probability distribution? 23 Traffic Data: Possible Explanation
  24. 24. “many variables such as height, weight, IQ scores, reading ability, job satisfaction, blood pressure turn out to have distributions that are bell-shaped or normal.”2 Popularized by Gauss in 1809 while he used it for analyzing astronomical data and hence now popularly known as the Gaussian Distribution. http://en.wikipedia.org/wiki/Normal_distribution 2http://peoplelearn.homestead.com/Topic3NORMAL1.html P(x) = G(μ, σ2) 24 Gaussian Distribution
  25. 25. 25 Multiple Gaussian Distributions: A Better Fit for Speed Observations? This distribution resembles a Gaussian Mixture Model (GMM)
  26. 26. Assume Normalcy to be uninterrupted traffic flow July 2014 has no events so, we hypothesize higher log-likelihood score June 2014 has many events so, we hypothesize lower log-likelihood score -115655.8 -125974.3 26 Golden Gate Fields: Comparing Months with Varying Event Occurrences
  27. 27. 27 Hourly Traffic Dynamics Over a Day
  28. 28. • Differentiate various traffic dynamics – Gaussian mixture model is too course grained as it does not discriminate between increasing traffic over an hour from decreasing traffic over the same hour. • Account for unobserved factors – Autoregressive models cannot capture unobserved factors • E.g., Traffic volume, which may be unobserved dictates the manifestation of events in link speed and travel time variations. – Linear Dynamical System introduces latent state-based model • E.g., Traffic volume (low vs high), road lane closures, and weather conditions (visibility) can impact how observations evolve. • Emission/Transition matrix and Gaussian noise captures stochasticity. T. K. Prasad 28 Modeling Traffic Dynamics: Statistical Models and Intuitions
  29. 29. • Characterize data time series (by learning distribution of each time point behavior using mean and variance) • Pick a realizable mediod time series as prototype for comparison summarized using LDS parameters 29 Linear Dynamical System Model
  30. 30. 30 Learning LDS Models
  31. 31. 31 Tagging Anomalies with LDS Models
  32. 32. • Normalcy : Log Likelihood scores of traces from event free data visualized as box and whiskers plot – Intertwined with long-term construction event influence • Anomaly : Log Likelihood score falls beyond whiskers threshold for eventful data T. K. Prasad 32 Log-likelihood score Tagging Anomalies: Intuitions
  33. 33. • How? – Extract events from textual tweets stream – Build statistical models of normalcy, and thereby anomaly, from numerical sensor data streams – Correlate multimodal streams, using spatio- temporal information, to annotate “anomalies” in sensor data time series with textual events T. K. Prasad 33 Traffic Domain Use Case (open data)
  34. 34. • If an anomaly is detected on a link L and during time period [tst, tet], then the anomaly is explained by an event if the event occurred in the vicinity within 0.5km radius and during [tst-1, tet+1]. • CAVEAT: An anomaly may not be explained because of missing data. T. K. Prasad 34 Spatio-temporal co-occurrence criteria
  35. 35. • Data collected from San Francisco Bay Area between May 2014 to May 2015 – 511.org: • 1,638 traffic incident reports • 1.4 billion speed and travel time observations – Twitter Data: 39,208 traffic related incidents extracted from over 20 million tweets1 • Naïve implementation for learning normalcy models for 2,534 links resulted in 40 minutes per link (~ 2 months of processing time for our data) – 2.66 GHz, Intel Core 2 Duo with 8 GB main memory • Scalable implementation by exploiting the nature of the problem resulted in learning normalcy models within 24 hours – The Apache Spark cluster used in our evaluation has 864 cores and 17TB main memory. 35 1Anantharam, P. 2014. Extracting city traffic events from so- cial streams. https://osf.io/b4q2t/wiki/home/ Experimental Data Statistics And Infrastructure
  36. 36. 36 Evaluation Results
  37. 37. Semantic Approach to Big Data and Event Processing Thank you! Any Question?

×