Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jiang Zhujiang.zhu@sv.cmu.eduDecember 13th, 2012                       1
Study the fundamental scientific problem    of    modeling an individual’s behavior from      heterogeneous sensory time-s...
• Derived from                   Behavioral Biometrics                      Behaviometrics• Behavioral: the way a human su...
Raw         Preprocessing   Applications Data           Modeling      ApplicationsGround          Evaluation     Applicati...
Heterogonous    Behavioral Text    Accountable  Sensor Data    Representation       Mobility                     n-gram Mo...
• Human behavior/activities share some common properties  with natural languages     • Meanings are composed from meanings...
Quantization   Clustering                            7
• Generative language model: P( English sentence) given a model   P(“President Obama has signed the Bill of … ”| Politics ...
• User activity at time t depends only on the last n-1 locations• Sequence of activities can be predicted by n consecutive...
• Long distance dependency of words in sentences   • tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” ...
• Build BehavioMetrics models for M classes P0, P1, P2, PM-1   • Genders, age groups, occupations   • Behaviors, activitie...
• Is this play Shakespeare’s work?• Comparing the play to Shakespeare’s known library of works• Track words and phases pat...
• A special binary classification problem• Given a normal BehavioMetrics model Pn, a new behavior text sequence L, and a t...
0. 8                                0. 7Aver age Log Pr obabi l i t y                                0. 6                 ...
• Convert feature vector series to label streams – dimension reduction• Step window with assigned length                 A...
• Induce underlying grammar of human activities   • Identify atomic activities through bracketing and collocation   • Gene...
1. Vocabulary Initialization using Time-series Motifs2. Super-Activity Discovery by Statistical Collocation3. Vocabulary G...
18
ACM MONE Journal, 2012                         19
20
• Collect RSS of the devices on multiple WAPs with timestamps• Aggregate and serialize into time series of RSS vectors* Li...
• Dimensionality in RSS vector – too fine for modeling• Proximity in location results in similar RSS vector• K-means clust...
Dataset                                     • RSS vector clusteringUsers              40                                  ...
• Testing samples   Positive sample: simulated anomaly by splicing traces from two different users   Negative sample: trac...
1                                   0.9                                   0.8              True Positive Rate             ...
1                     0.9                     0.8                     0.7                     0.6          Accuracy       ...
Quantization              ClusteringRisk Analysis       Sensor Fusion            Activity    Tree           and Segmentati...
Sensing            Preprocessing                                             Modeling                                     ...
• Accelerometer   • Used to summarize     acceleration stream   • Calculated separately for each     dimension [x,y,z,m]  ...
30
• Offline data collection (for training and testing)    Pick up the device from a desk    Unlock the device using the righ...
• 71.3% True-Positive Rate with 13.1% False Positive                                                       32
• Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012• False Positive: 13% FPR still annoying users sometime...
• Human stress need to be properly handled   • DARPA - Detection and Computational Analysis of Psychological Signals   • D...
35
Heterogonous    Behavioral Text    Accountable  Sensor Data    Representation       Mobility                     n-gram Mo...
Language approach in modeling                              Build and release 3 applicationsuser behavior via textual      ...
“MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]"SenSec: Mobi...
Thank you.
Upcoming SlideShare
Loading in …5
×

BehavioMetrics: A Big Data Approach

1,032 views

Published on

The penetration of mobile devices equipped with various embedded sensors also make it possible to capture the physical and virtual context of the user and surrounding environment. Further, the modeling of human behaviors based on those data becomes very important due to the increasing popularity of context-aware computing and people-centric applications, which utilize users' behavior pattern to improve the existing services or enable new services. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion.

Published in: Technology
  • Be the first to comment

BehavioMetrics: A Big Data Approach

  1. 1. Jiang Zhujiang.zhu@sv.cmu.eduDecember 13th, 2012 1
  2. 2. Study the fundamental scientific problem of modeling an individual’s behavior from heterogeneous sensory time-series• Data collected from physical and soft sensors• Apply the behavioral models to real applications • Security: Accountable Mobility Model • Mobile Security: SenSec • Psychological status estimation: StressSens 2
  3. 3. • Derived from Behavioral Biometrics Behaviometrics• Behavioral: the way a human subject behaves• Biometrics: technologies and methods that measure and analyzes biological characteristics of the human body • Finger prints, eye retina, voice patterns• BehavioMetrics: Measurable behavior to Recognize or to Verify • Identity of a human subject, or • Subject’s certain behaviors 3
  4. 4. Raw Preprocessing Applications Data Modeling ApplicationsGround Evaluation Applications Truth 4
  5. 5. Heterogonous Behavioral Text Accountable Sensor Data Representation Mobility n-gram MobiSens Skipped n-gram Helix, Helix Tree SenSec DT, RF, SVM… Sim. Attacks Ctrl. Exp. Prec. RecallAuth. Records Accuracy StressSens Mem. Test Error & FP 5
  6. 6. • Human behavior/activities share some common properties with natural languages • Meanings are composed from meanings of building blocks • Exists an underlying structure (grammar) • Expressed as a sequence (time-series)• Apply rich sets of Statistical NLPs to mobile sensory data 6
  7. 7. Quantization Clustering 7
  8. 8. • Generative language model: P( English sentence) given a model P(“President Obama has signed the Bill of … ”| Politics ) >> P(“President Obama has signed the Bill of … ” | Sports ) LM reflects the n-gram distribution of the training data: domain, genre, topics.• With labeled behavior text data, we can train a LM for each activity type: “walking”-LM, “running”-LM and classify the activity as 8
  9. 9. • User activity at time t depends only on the last n-1 locations• Sequence of activities can be predicted by n consecutive activities in the past• Maximum Likelihood Estimation from training data by counting:• MLE assign zero probability to unseen n-grams Incorporate smoothing function (Katz) Discount probability for observed grams Reserve probability for unseen grams 9
  10. 10. • Long distance dependency of words in sentences • tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” “the tennis ball” • “I hit ball” not captured• Future activities depends on activities far in the past. Intermediate behavior has little relevance or influence • Noise in the data sets: “ping-pong” effects in time- series, interference, sampling errors, etc • Model size 10
  11. 11. • Build BehavioMetrics models for M classes P0, P1, P2, PM-1 • Genders, age groups, occupations • Behaviors, activities, actions • Health and mental status• For a new behavioral text string L, we calculate the probability if L is generated by model m• Classification problem formulated as 11
  12. 12. • Is this play Shakespeare’s work?• Comparing the play to Shakespeare’s known library of works• Track words and phases patterns in the data• Calculate the probability the unknown U given all the known Shakespeare’s work {S}• Compare with a threshold θ • Authentic work (a=1) • Fake, Forgery or Plagiarism (a=0) 12
  13. 13. • A special binary classification problem• Given a normal BehavioMetrics model Pn, a new behavior text sequence L, and a threshold θ, calculate the likelihood L is generated by Pn and compare with θ• If the outcome is -1, flag an anomaly alert• Variation caused by noise could be smoothed out statistically• Need certain feedbacks to handle false positives, usually caused by unseen behaviors or sub-optimal threshold. 13
  14. 14. 0. 8 0. 7Aver age Log Pr obabi l i t y 0. 6 0. 5 0. 4 C D A 0. 3 0. 2 Log Probility B Low Threshold High Threshold 0. 1 0 Sl i di ng W ndow Posi t i on i 14
  15. 15. • Convert feature vector series to label streams – dimension reduction• Step window with assigned length A1 A2 A1 A4 G2 G5 G2 G2 W2 W1 W2 P1 P3 P6 P1 A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1 15
  16. 16. • Induce underlying grammar of human activities • Identify atomic activities through bracketing and collocation • Generalize semantically similar activities into higher level activities. 16
  17. 17. 1. Vocabulary Initialization using Time-series Motifs2. Super-Activity Discovery by Statistical Collocation3. Vocabulary Generalization via Aggregated Similarity 17
  18. 18. 18
  19. 19. ACM MONE Journal, 2012 19
  20. 20. 20
  21. 21. • Collect RSS of the devices on multiple WAPs with timestamps• Aggregate and serialize into time series of RSS vectors* Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment” 21
  22. 22. • Dimensionality in RSS vector – too fine for modeling• Proximity in location results in similar RSS vector• K-means clustering algorithm with distance function similar to WASP[1] and each cluster assigned a pseudo location label[1] Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment” 22
  23. 23. Dataset • RSS vector clusteringUsers 40 • Run small subset trace with Cisco SJC 14 1FLocation Alpha networks different K and evaluate clustering performance byRSS 13 sec average distance to centroidssampling ratePeriod 5 days • K = 3X #WAPs has the best trade-offsNumber of WAPs 87 • Yield ~260 pseudo locations Cisco AironetDevice 1500 + MSEDataset Size 3.2 mil points 23
  24. 24. • Testing samples Positive sample: simulated anomaly by splicing traces from two different users Negative sample: trace from “owner” 24
  25. 25. 1 0.9 0.8 True Positive Rate 0.7 0.6 0.5 0.4 0.3 0.2 Data Size (12 Hrs) 0.1 Data Size (8 Hrs) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive RateSource information is set at 12 points. 25
  26. 26. 1 0.9 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 Data size (4hr) 0.2 Data size (8hr) 0.1 Data size (12hr) 0 0 1 2 3 4 5 6 7 8 9 10 n-gram orderSource information is set at 12 points. 26
  27. 27. Quantization ClusteringRisk Analysis Sensor Fusion Activity Tree and Segmentation Recognition Certainty of Risk Application Sensitivity < Application Access Control Application Access Control 27
  28. 28. Sensing Preprocessing Modeling N-gram Model Feature Behavior Text Construction Generation User Classifier Classification• SenSec collects sensor data •Motion sensors User Classifier Binary Authentication •GPS and WiFi Scanning Threshold •In-use applications and their traffic patterns Inference• SenSec modulebuild user behavior models • Unsupervised Activity Segmentation and model the sequence using Language model • Building Risk Analysis Tree (DT) to detect anomaly • Combine above to estimate risk (online): certainty score• Application Access Control Module activate authentication based on the score and a customizable threshold. 28
  29. 29. • Accelerometer • Used to summarize acceleration stream • Calculated separately for each dimension [x,y,z,m] • Meta features: Total Time, Window Size• GPS: location string from Google Map API and mobility path• WiFi: SSIDs, RSSIs and path• Applications: Bitmap of well-known applications• Application Traffic Pattern: TCP UDP traffic pattern vectors: [ remote host, port, rate ] 29
  30. 30. 30
  31. 31. • Offline data collection (for training and testing) Pick up the device from a desk Unlock the device using the right slide pattern Invoke Email app from the "Home Screen" Lock the device by pressing the "Power" button Put the device back on the desk 31
  32. 32. • 71.3% True-Positive Rate with 13.1% False Positive 32
  33. 33. • Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012• False Positive: 13% FPR still annoying users sometimes• Use adaptive model • Adding the trace data shortly before a false positive to the training data and update the model• Change passcode validation to sliding pattern• A false positive will grant a “free ride” for a configurable duration • Assumption: just authenticated user should control the device for a given period of time• “Free Ride” period will end immediately if abrupt context change is detected.• Newer version is scheduled to be release in Jan 2013. 33
  34. 34. • Human stress need to be properly handled • DARPA - Detection and Computational Analysis of Psychological Signals • Develop analytical tools to assess psychological status of war fighters • Improve psychological health awareness and enable them to seek timely help• Measurement of Stress is expensive and time-consuming • Expensive medical procedures: EKG, EEG • Self-report: questionnaires, interviews, surveys• BehavioMetrics-based estimation • Monitor mouse movements, screen touches(Windows 8), key strokes, active applications, network traffic patterns to build Behaviometrics. • Use memory test and other mental exercise results as ground truth. • Perform classification and regression to build Behavior-Stress models. 34
  35. 35. 35
  36. 36. Heterogonous Behavioral Text Accountable Sensor Data Representation Mobility n-gram MobiSens Skipped n-gram Helix, Helix Tree SenSec DT, RF, SVM… Sim. Attacks Ctrl. Exp. Prec. RecallAuth. Records ROC StressSens Mem. Test Accuracy FP 36
  37. 37. Language approach in modeling Build and release 3 applicationsuser behavior via textual • MobiSensrepresentation of heterogeneous • SenSectime-series • StressSensEvaluate and adapt NLPtechniques to BehavioMetrics in Gain insights from experiments andactivity provide guidelines in selectingsegmentation, recognition, classific models, tuning parameters andation and anomaly detection from improving UXsequential data Valuable labeled or partially labeledUnsupervised Helix and Helix-TF data sets to enable otherto discover hierarchical structure in BehavioMetric researchBehavioMetrics for generalclassification and anomalydetection© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
  38. 38. “MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]"SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference onComputing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [withP.Wu, X.Wang, J.Zhang]“Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, acceptedto ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang] "Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on DataMining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan andY.Zhang]" Mobile Lifelogger - recording, indexing, and understanding a mobile users life", in the Proceedings of The SecondInternational Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [WithS.Chennuru, P.Cheng, Y.Zhang]"SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang]"Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of TheIEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-KaiPeng, Pang Wu, and Ying Zhang]"Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEEGlobal Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with FeiXiong, Dongzhen Piao, Yun Liu, and Ying Zhang]"A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F.Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang]"Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference onInformation Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang] 38
  39. 39. Thank you.

×