Icccn2011 jiang-0802

Jiang Zhu and Joy Y. Zhang
Carnegie Mellon University

August 2nd, 2011

1

• Monitor and track user mobility behavior in WLAN environment
using RSS trace
• Convert mobility traces and other context information to Behavior
Text representations
• Build n-gram language model with behavior text and use it for
anomaly detection to discover loss or theft events

2

60% Miami
New York
50%
Los Angeles
40% Phoenix

30% Sacramento
Chicago
20%
Dallas
10% Houston
0% Philadelphia
Boston
Mobile Device Loss or theft San
Francisco
frequent visited

Strategy One Survey conducted among a U.S. sample of 3017 adults age 18 years older in September 21-
28, 2010, with an oversample in the top 20 cities (based on population).

4

Business and personal • CAPEX loss
applications running together • Data loss
Corporate messaging, email on • Recovery effort
personal devices
•Loss of business
Intranet wireless access on
personal devices ―The 329 organizations polled had
Personal finance and banking on collectively lost more than 86,000
devices … with average cost of lost
corporate devices
data at $49,246 per device, worth
Mobile payments and credentials $2.1 billion or $6.4 million per
organization.

"The Billion Dollar Lost-Laptop Study," conducted by Intel
Corporation and the Ponemon Institute, analyzed the
scope and circumstances of missing laptop PCs.

5

Detection
To discover the Mitigation
loss and theft early
enough to initiate Revoke access to
other steps sensitive
data, applications
or services
Notification
Notify
owners, administra Recovery
tors or authority
Rescue device
Recover/restore
data

6

• Mobilityas Behavior
• Mobility modeling is a well studied research area
• Can be measured and tracked: Wi-Fi, GPS, Cellular, etc
• Other contextual information can be combined: Bluetooth, accelerometer, etc

• Other motivating applications
• Healthcare: Inpatient telemetry.
• Education: Young children monitoring
• Law reinforcement: Inmates monitoring and control

7

• Past and current location trigger future locations

Hallway A Office

Break
Room

Hallway B Bathroom

• User mobility as short sequence of locations
[1] [2]
• ―Language as action‖: Language vs. streams of sensor data
• Composing elements: sensor data vs. words in corpus
• Sequence structure: local dependency vs. ―grammar‖
[1] Aipperspach, et al, ―Modeling Human Behavior from Simple sensors in the Home‖, PerCom 2006
[2] Buthpitya, et al, ―n—gram Geo-Trace Modeling‖, Pervasive 2011

9

• User location at time t depends only on the last n-1 locations

• Sequence of locations can be predicted by n consecutive location
in the past

• Maximum Likelihood Estimation from training data by counting:

• MLE assign zero probability to unseen n-grams
Incorporate smoothing function (Katz)
Discount probability for observed grams
Reserve probability for unseen grams

10

• Long distance dependency of words in sentences
• tri-grams for ―I hit the tennis ball‖: ―I hit the‖, ―hit the tennis‖ ―the tennis ball‖
• ―I hit ball‖ not captured

• Future pseudo location depends on locations far in the past.
Intermediate behavior has little relevance or influence
• Noise in the data collected: ―ping-pong‖ effect in WLAN
association, interference, sampling errors, etc
• Model size

11

Preprocessing Anomaly
Detection

RSS N-gram
Trace Model

Sensing

Anomaly Y/N

12

• Collect RSS of the devices on multiple WAPs with timestamps

• Aggregate and serialize into time series of RSS vectors

* Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖

13

• Dimensionality in RSS vector – too fine for modeling

• Proximity in location results in similar RSS vector

• K-means clustering algorithm with distance function similar to
WASP[1] and each cluster assigned a pseudo location label

[1] Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖

14

• Repeating location labels dominate n-gram statistics

• Extracting ―duration‖ by counting repeating labels

• Only append ―duration‖ label if Mutual Information of locationand
duration is high
• Dependency - ―Conference Room‖ + ―1 hours‖ infer ―Meeting‖
• Personal - ―Professor’s Office‖ + ―10 minutes‖ infer ―Student’s quick chat‖

• Segment behavior text sequences based on time-of-day

• Behavior follows routine and agenda

• Varying among users

• Cut the boundary based on activity level

15

Extract Preprocessing Anomaly
Pseudo Detection
Location

Behavior Text
RSS N-gram
Generation
Trace Model
Fusion

Extract
Sensing
Other
Features

Anomaly Y/N

16

• Feed sequence of the past locations in a sliding window of size N
to n-gram model for testing
• For a testing sequence of pseudo locations

• Estimate the average log probability this sequence is generated
from the n-gram or skipped n-gram model

• If this likelihood drops below a threshold, flag an anomaly alert

17

0. 8

0. 7
Aver age Log Pr obabi l i t y

0. 6

0. 5

0. 4
C D A
0. 3

0. 2
Log Probility B
Low Threshold
High Threshold
0. 1

0
Sl i di ng W ndow Posi t i on
i

18

Pseudo Detection
Location

Behavior Text
RSS N-gram
Generation
Trace Model
Fusion

Extract
Sensing
Other
Features

Threshold >

Anomaly Y/N

19

Dataset
• RSS vector clustering
Users 40
• Run small subset trace with
Cisco SJC 14 1F
Location
Alpha networks
different K and evaluate
clustering performance by
RSS
13 sec average distance to centroids
sampling rate
Period 5 days • K = 3X #WAPs has the best
trade-offs
Number of WAPs 87
• Yield ~260 pseudo locations
Cisco Aironet
Device
1500 + MSE
Dataset Size 3.2 mil points

21

• Testing samples
Positive sample: simulated anomaly by splicing traces from two different users
Negative sample: trace from ―owner‖

22

• Train n-gram models with 8 hour data

• Continuous 5-gram model and Skipped 3-gram with skipping
factor k=2 result in similar accuracy ~ 60%
• Model complexity: k-order reduction
• Skip factor K is data dependent: particular scenarios in our data set: office
with hallways and corridors
• Further investigation needed to find the optimal K.

• Replacing repeating labels with duration feature improve the
model
Before collapsing, 5-gram statistics are dominated by several sequences with
long repeating locations. Top 200 grams are repeating labels
After collapsing, 5-gram statistics are well distributed

• Time-of-day has only marginal improvement, <1%

23

1
0.9
0.8
True Positive Rate

0.7
0.6
0.5
0.4
0.3
0.2 Data Size (12 Hrs)
0.1 Data Size (8 Hrs)

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Source information is set at 12 points.

24

1

0.9

0.8

0.7

0.6
Accuracy

0.5

0.4

0.3
Data size (4hr)
0.2
Data size (8hr)
0.1 Data size (12hr)
0
0 1 2 3 4 5 6 7 8 9 10
n-gram order
Source information is set at 12 points.

25

Pseudo Detection
Location

Behavior Text
RSS N-gram
Generation
Trace Model
Fusion

Extract
Sensing
Other
Features

Threshold >

Anomaly Y/N
• Experiments to discover loss or theft event through anomaly
detection with 70~80% accuracy with only 8 hours of training data
27

Thank you.

And special thanks to our sponsors
CyLab Mobility Research Center
Cisco Systems Inc. Army
Research Office

• Extract Mobility model from real trace in WLAN environment
[1]

• Extract mobility tracks, duration from WLAN association records
• Analyze mobility characteristics: pause time, speed, direction, destination
region and their distributions
• Build empirical model to generate synthetic trace

• Steady state and transient behavior can be modeled with Semi-
[2]
Markov model
Transition probability matrix and sojourn time distribution

• Language model to model behavior from sensors in home
[3]

Show support on similarity between language and behavior
Smoothed n-gram model to make single-step prediction on binary sensor
readings from smarthome
[1] Kim et al, ―Extract a Mobility model from Real User Traces, INFORCOM 2006
[2] Lee and Hou, ―Modeling Steady-State and Transient Behaviors of User Mobility‖, MobiHoc 2006
[3] Aipperspach, et al, ―Modeling Human Behavior from Simple sensors in the Home‖, PerCom 2006

33

• Overhead and lack • Model complexity • It is straightforward
of granularity in and computational to convert binary
inferring user overhead not sensor data to
location and pause suitable for real behavior text for
time from WLAN time application LM-based
association [Lee’06] analysis.[Aipp’06]
records[Kim’06]
• Simple and cost- • Heterogeneous
• Fine-grain, higher effective model to multi-valued
dimension trace capture mobility sensory data is
data to model reducing ping-pong hard to convert to a
mobility behavior, effects single-dimension
such as RSS behavior text
beacons trace

34

• Calculate coordinates for each RSS vector using ―Indoor location‖
algorithm[1] and generate hot region plot

[1] Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖

36

• Select 10 users with the least cross entropy

37

• Help Cisco to adopt this model to Mobility Service Engine

• Heterogeneous sensor data fusion
Network traffic patterns from wireless controllers
Applications, Memory and battery status
GPS, accelerometers, gyroscope, temperature, etc

• Advanced Model
Leverage the internal factorized relationships among various sensors
• Factor Language Model

• More Applications
Prediction: resource allocation, energy saving, personalized services
Anomaly detection: adaptive authentication, patient telemetry

38

• Confirm similarity between language and behavior

• Multi-dimension to single dimension and n-gram: low complexity
but good results
• Potential problems:
•Dimensionality reduction to 1-D to use language approach in modeling may
cause loss of the relationship among multi-dimensional data

Sensor 1

Sensor 2

State

•Skipped n-gram approach is dependent on the data and may only have
marginal improvement or even worse results.

40

Icccn2011 jiang-0802

Recommended

Recommended

More Related Content

Similar to Icccn2011 jiang-0802

Similar to Icccn2011 jiang-0802 (20)

Recently uploaded

Recently uploaded (20)

Icccn2011 jiang-0802

Editor's Notes