BehavioMetrics: A Big Data Approach


Published on

The penetration of mobile devices equipped with various embedded sensors also make it possible to capture the physical and virtual context of the user and surrounding environment. Further, the modeling of human behaviors based on those data becomes very important due to the increasing popularity of context-aware computing and people-centric applications, which utilize users' behavior pattern to improve the existing services or enable new services. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • So building along this line, we use a continousn-gram model to learn the sequence of locations from user’s wifi traces.N-gram model works under the assumptions that the next location in the sequence .. depends on just the last n-1 locations… Once the n-gram model is trained, we can use it to calculate the probability of all possible next locations given the past n-1 locations…. and see which one is the most likely location.To train the model, we use maximum likelihood estimation on the training sequences to estimate these conditional probability … just by counting. As show in this equation, MLE probability of being in location at time i conditioned on the past n-1 history locations is… just the count of all n sequences in the data divided by the count of all these n-1 sequences. There is one small problem with this approach. Let’s say our model come across a location that has not been seen in the training. It just assumes a zero probability. This may push the system to trigger anomaly alert. Luckily, N-gram model is very robust in handling unseen labels if we use smoothing. Smoothing algorithms such as Katz… are to take some probability mass from the seen lables and reserve them for those unseen lables.
  • In natural language, words in a sentence may have long-distance dependencies. For example, the sentence “I hit the tennis ball” … has 3 tri-grams.. “I hit the” … “hit the tennis” .. And.. “the tennis ball” It is clear that an equally important tri-gram “I hit ball” is not normally captured by the continuous n-gram… because the separators ‘the” “tennis” is in the middle. If we could skip the separators … and we can form this important tri-gram. I hit ball Similarity, in our continuous n-gram model I just described, user’s next locations is dependent only on his n-1 previous locations. However, in many cases this may not be true.Use the same example, if a user is leaving the break room and entering hallway that leads to his office, we can predict he will be in his office soon. The intermediate locations along the hallway and before entering the office are not that important. Those locations can be skipped in the modeling. As shown in the diagram here, ABC is the break room, ACD is the entrance of the hallway and EDB is the office. Anything in the middle can be skipped and still give the same results. By skipping detracting grams, now… the effective n-gram order becomes (n-d). Therefore, we can reduce the size of the model in terms of computation and storage because the n-gram model has better performance for a lower value of n.
  • Once we constructed a model of a user's behaviometrics through learning, we can continue monitoring user's behaviometrics and compare them with the learned model. If the new behaviometrics deviate from the learned model, we may choose to trigger an anomaly alert. However, variations in sensory data streams could also be caused by noise and new behaviors in addition to anomalous behaviors. Variations caused by noise is less significant and can be smooth out statistically. On the other hand, to distinguish between anomalous and new behaviors, we need to evaluate if those unseen patterns can be incorporated into the model over time. Failing to identify such a distinction might yield false positive temporarily, but if certain feedback mechanisms are in place to correct those false positives, we are still able to build a robust anomaly detection system in various application domains such as theft detection and prevention, casual authentication, emergency detection and healthcare monitoring.
  • To illustrate this process, let’s take a look at an example.The blue curve is the log probability we just described. Let’s say anomaly happens at point A. If we set the threshold lower like the red line, the system will detect the anomaly at point B with a reasonable delay. But if we set the threshold too high like the pink line, we will mistakenly flag an anomaly for a sequence of normal behavior text…. Which is counted towards false positives at points C and D. The way to find the right threshold for different applications is to use receiver-operating-characteristic curve or ROC curve. We will look at this in more details later in the talk.
  • Thinking of a simple example, where the red traces in this office floor represent the usual mobility of a user. In this case, this user is finishing a meeting in a conference room and is going back to his cubicle. << hit enter >>Now, if we look at the another path user is taking, instead of going this way, he is going towards the other direction. <<hit enter>>Then deviating further and further like thisIn such a case, we would want to flag this as an anomaly. It could be a case that a visitor who attend the meeting and took the device the employee forgot in the conference room and went away. the device may still has the access to company internal network and other data source, by receiving this alert, the infrastructure would revoke his authentication credentials temporarily until the user can authentication himself again. <<hit enter>>Now, if in stead of going further away, he is going back to his cubile, just by taking an alternate path. In this case, we probably do not want to flag this as a anomaly
  • The management, control and data frames from a device will be heard by multiple APs. In our particular setup, these APs will record the Received signal strength or RSS of those frame along with the Identity of the device and timing information.These traces will be aggregated to a central location .. where we can serialize these traces based on the time stamp and classify them using the device IDs. So.. for a particular device, we can build a time series of RSS vector, each element in the vector is the RSS from a particular AP. These series of RSS vector along with other context information serves as the input to the preprocessing module…. Where we will convert these to a text representation before feed them into our n-gram model.
  • From the signal propagation model, if two vectors are very similar, we know that the location where this vectors are measured should be within a reasonable proximity. Based on this assumption, we want to partition the RSS vector space into many “pseudo locations” and assign each “pseudo location” a unique label. By pseudo, we mean we don’t need to know the exact location of the reading, we just need to distinguish between two different locationsWell, this can be easily done by clustering algorithm… for example K-means clustering. In the k-mean clustering runs, we use a distance function similar to redpin and WASP in addition to the standard cosine function to reduce the noise caused by interference.Once the clustering is done, we assign labels to all the members belong to the same cluster….
  • So… we collected the RSS traces from 87 WAPs in an office building over 5 days. The time precision of the RSS sample is at 13 sec level. These traces contain complete data of 40 users and … in total we have about 3.2 mil data points. Backup data points:Pseudo location from RSS (other schem not very ….) 1500 data points (RSS) per user at average RSS from 3-7 WAPs.assume user up half of the time -> 80k data points per user for 5 days3.2 mil data points collected for 40 users. 20 mils rss readingsFor each of these 40 users, 16K RSS vector total
  • To validate our system, we need to have some testing data. However, from the trace we collected, there are no recorded anomaly fortunately. We created simulated device stolen events by splicing two users’ trace segments at their intersection points…. where similar label or labels sequences are shared. We combined this simulated traces with normal traces to create a testing data set.
  • Now we gained some insights on our approach. It is time to explore some of the design parameters we mentioned in the beginning. The first set of experiments is to find the best anomaly detection threshold. Actually there is no best threshold, the threshold is depending on the applications we are running. What’s the requirements on the detection accuracy? Can we allow much false positive? Do we have enough training data? To provide a guideline in answering these questions, we plot Receiver Operating Characteristic curve (or ROC curve) Essentially, ROC curve is about the trade-offs between the true-positive rate and false-positive rate in our anomaly detection. We perform the experiments with different training data sizes. We plot the ROC curve by varying the threshold and record the TPR and FPRWith the ROC curve, we can decide the threshold for a particular application depending on The amount of data the model should see before the model can detect anomaly The required TPR Or the acceptable FPRFor example, we want to use 8 hour training size and want to have less than 0.1 false positive rate, then we just need to locate this point and obtain the threshold by which this data point is generated. (0.4) We need to use threshold < 0.4 in order to fulfill the FPR requirement. Another example: let’s say we want to have the same FPR requirement but want to have TPR > 0.8, then we have to use more than 8 hours training size to archive this goal.
  • We plot this graphs with different training size and n-gram orders. From the graph, we can see several things. A higher order model captures more context and in turn increase accuracy. But…. , accuracy saturates beyond 5, which means in user’s behavior is more likely to be dependent on its last 5 pseudo locations. This resonates with the past work we mentioned in the beginning. It also tells us that increase the model complexity beyond this point will NOT bring about significant improvement.Second, it shows that if the training size is as small as 4 hours, it may not capture users’ mobility behavior thoroughly enough to make an accurate detection. Also, the closeness between 8 hr and 12 hour curves also suggests that our system will provide relative good results if we have observed users’ behavior for 8 hours. One interesting point to make here is the 12 hour and 8 hour curve cross over at the lower n-gram orders. While this could be due to errors in handling the data, our explanation is leaning towards that the bigger training data set will exposure more common locations that are not captured in the shorter training size. With these common locations, people are sharing a lot of shorter sequences, leading to more simulated anomaly are not detected and … bring down the accuracy.
  • SenSec is constantly collecting sensory data from accelerometer, gyroscope, GPS, WiFi, microphone or even camera. Through analyzing the sensory data, it constructs the context under which the mobile device is used. This includes locations, movements and usage patterns, etc. From the context, the system can calculate the certainty that the system is at risk. Different applications on mobile device are assigned either manually or automatically with a sensitivity value. When user is invoking an application, SenSec compares the certainty with this application’s sensitivity level. If the sensitivity passes the certainty threshold, authentication mechanism would be employed to ensure security policy for that application.
  • That brings me to the end of my presentation. Thank you very much for your attention.
  • BehavioMetrics: A Big Data Approach

    1. 1. Jiang Zhujiang.zhu@sv.cmu.eduDecember 13th, 2012 1
    2. 2. Study the fundamental scientific problem of modeling an individual’s behavior from heterogeneous sensory time-series• Data collected from physical and soft sensors• Apply the behavioral models to real applications • Security: Accountable Mobility Model • Mobile Security: SenSec • Psychological status estimation: StressSens 2
    3. 3. • Derived from Behavioral Biometrics Behaviometrics• Behavioral: the way a human subject behaves• Biometrics: technologies and methods that measure and analyzes biological characteristics of the human body • Finger prints, eye retina, voice patterns• BehavioMetrics: Measurable behavior to Recognize or to Verify • Identity of a human subject, or • Subject’s certain behaviors 3
    4. 4. Raw Preprocessing Applications Data Modeling ApplicationsGround Evaluation Applications Truth 4
    5. 5. Heterogonous Behavioral Text Accountable Sensor Data Representation Mobility n-gram MobiSens Skipped n-gram Helix, Helix Tree SenSec DT, RF, SVM… Sim. Attacks Ctrl. Exp. Prec. RecallAuth. Records Accuracy StressSens Mem. Test Error & FP 5
    6. 6. • Human behavior/activities share some common properties with natural languages • Meanings are composed from meanings of building blocks • Exists an underlying structure (grammar) • Expressed as a sequence (time-series)• Apply rich sets of Statistical NLPs to mobile sensory data 6
    7. 7. Quantization Clustering 7
    8. 8. • Generative language model: P( English sentence) given a model P(“President Obama has signed the Bill of … ”| Politics ) >> P(“President Obama has signed the Bill of … ” | Sports ) LM reflects the n-gram distribution of the training data: domain, genre, topics.• With labeled behavior text data, we can train a LM for each activity type: “walking”-LM, “running”-LM and classify the activity as 8
    9. 9. • User activity at time t depends only on the last n-1 locations• Sequence of activities can be predicted by n consecutive activities in the past• Maximum Likelihood Estimation from training data by counting:• MLE assign zero probability to unseen n-grams Incorporate smoothing function (Katz) Discount probability for observed grams Reserve probability for unseen grams 9
    10. 10. • Long distance dependency of words in sentences • tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” “the tennis ball” • “I hit ball” not captured• Future activities depends on activities far in the past. Intermediate behavior has little relevance or influence • Noise in the data sets: “ping-pong” effects in time- series, interference, sampling errors, etc • Model size 10
    11. 11. • Build BehavioMetrics models for M classes P0, P1, P2, PM-1 • Genders, age groups, occupations • Behaviors, activities, actions • Health and mental status• For a new behavioral text string L, we calculate the probability if L is generated by model m• Classification problem formulated as 11
    12. 12. • Is this play Shakespeare’s work?• Comparing the play to Shakespeare’s known library of works• Track words and phases patterns in the data• Calculate the probability the unknown U given all the known Shakespeare’s work {S}• Compare with a threshold θ • Authentic work (a=1) • Fake, Forgery or Plagiarism (a=0) 12
    13. 13. • A special binary classification problem• Given a normal BehavioMetrics model Pn, a new behavior text sequence L, and a threshold θ, calculate the likelihood L is generated by Pn and compare with θ• If the outcome is -1, flag an anomaly alert• Variation caused by noise could be smoothed out statistically• Need certain feedbacks to handle false positives, usually caused by unseen behaviors or sub-optimal threshold. 13
    14. 14. 0. 8 0. 7Aver age Log Pr obabi l i t y 0. 6 0. 5 0. 4 C D A 0. 3 0. 2 Log Probility B Low Threshold High Threshold 0. 1 0 Sl i di ng W ndow Posi t i on i 14
    15. 15. • Convert feature vector series to label streams – dimension reduction• Step window with assigned length A1 A2 A1 A4 G2 G5 G2 G2 W2 W1 W2 P1 P3 P6 P1 A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1 15
    16. 16. • Induce underlying grammar of human activities • Identify atomic activities through bracketing and collocation • Generalize semantically similar activities into higher level activities. 16
    17. 17. 1. Vocabulary Initialization using Time-series Motifs2. Super-Activity Discovery by Statistical Collocation3. Vocabulary Generalization via Aggregated Similarity 17
    18. 18. 18
    19. 19. ACM MONE Journal, 2012 19
    20. 20. 20
    21. 21. • Collect RSS of the devices on multiple WAPs with timestamps• Aggregate and serialize into time series of RSS vectors* Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment” 21
    22. 22. • Dimensionality in RSS vector – too fine for modeling• Proximity in location results in similar RSS vector• K-means clustering algorithm with distance function similar to WASP[1] and each cluster assigned a pseudo location label[1] Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment” 22
    23. 23. Dataset • RSS vector clusteringUsers 40 • Run small subset trace with Cisco SJC 14 1FLocation Alpha networks different K and evaluate clustering performance byRSS 13 sec average distance to centroidssampling ratePeriod 5 days • K = 3X #WAPs has the best trade-offsNumber of WAPs 87 • Yield ~260 pseudo locations Cisco AironetDevice 1500 + MSEDataset Size 3.2 mil points 23
    24. 24. • Testing samples Positive sample: simulated anomaly by splicing traces from two different users Negative sample: trace from “owner” 24
    25. 25. 1 0.9 0.8 True Positive Rate 0.7 0.6 0.5 0.4 0.3 0.2 Data Size (12 Hrs) 0.1 Data Size (8 Hrs) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive RateSource information is set at 12 points. 25
    26. 26. 1 0.9 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 Data size (4hr) 0.2 Data size (8hr) 0.1 Data size (12hr) 0 0 1 2 3 4 5 6 7 8 9 10 n-gram orderSource information is set at 12 points. 26
    27. 27. Quantization ClusteringRisk Analysis Sensor Fusion Activity Tree and Segmentation Recognition Certainty of Risk Application Sensitivity < Application Access Control Application Access Control 27
    28. 28. Sensing Preprocessing Modeling N-gram Model Feature Behavior Text Construction Generation User Classifier Classification• SenSec collects sensor data •Motion sensors User Classifier Binary Authentication •GPS and WiFi Scanning Threshold •In-use applications and their traffic patterns Inference• SenSec modulebuild user behavior models • Unsupervised Activity Segmentation and model the sequence using Language model • Building Risk Analysis Tree (DT) to detect anomaly • Combine above to estimate risk (online): certainty score• Application Access Control Module activate authentication based on the score and a customizable threshold. 28
    29. 29. • Accelerometer • Used to summarize acceleration stream • Calculated separately for each dimension [x,y,z,m] • Meta features: Total Time, Window Size• GPS: location string from Google Map API and mobility path• WiFi: SSIDs, RSSIs and path• Applications: Bitmap of well-known applications• Application Traffic Pattern: TCP UDP traffic pattern vectors: [ remote host, port, rate ] 29
    30. 30. 30
    31. 31. • Offline data collection (for training and testing) Pick up the device from a desk Unlock the device using the right slide pattern Invoke Email app from the "Home Screen" Lock the device by pressing the "Power" button Put the device back on the desk 31
    32. 32. • 71.3% True-Positive Rate with 13.1% False Positive 32
    33. 33. • Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012• False Positive: 13% FPR still annoying users sometimes• Use adaptive model • Adding the trace data shortly before a false positive to the training data and update the model• Change passcode validation to sliding pattern• A false positive will grant a “free ride” for a configurable duration • Assumption: just authenticated user should control the device for a given period of time• “Free Ride” period will end immediately if abrupt context change is detected.• Newer version is scheduled to be release in Jan 2013. 33
    34. 34. • Human stress need to be properly handled • DARPA - Detection and Computational Analysis of Psychological Signals • Develop analytical tools to assess psychological status of war fighters • Improve psychological health awareness and enable them to seek timely help• Measurement of Stress is expensive and time-consuming • Expensive medical procedures: EKG, EEG • Self-report: questionnaires, interviews, surveys• BehavioMetrics-based estimation • Monitor mouse movements, screen touches(Windows 8), key strokes, active applications, network traffic patterns to build Behaviometrics. • Use memory test and other mental exercise results as ground truth. • Perform classification and regression to build Behavior-Stress models. 34
    35. 35. 35
    36. 36. Heterogonous Behavioral Text Accountable Sensor Data Representation Mobility n-gram MobiSens Skipped n-gram Helix, Helix Tree SenSec DT, RF, SVM… Sim. Attacks Ctrl. Exp. Prec. RecallAuth. Records ROC StressSens Mem. Test Accuracy FP 36
    37. 37. Language approach in modeling Build and release 3 applicationsuser behavior via textual • MobiSensrepresentation of heterogeneous • SenSectime-series • StressSensEvaluate and adapt NLPtechniques to BehavioMetrics in Gain insights from experiments andactivity provide guidelines in selectingsegmentation, recognition, classific models, tuning parameters andation and anomaly detection from improving UXsequential data Valuable labeled or partially labeledUnsupervised Helix and Helix-TF data sets to enable otherto discover hierarchical structure in BehavioMetric researchBehavioMetrics for generalclassification and anomalydetection© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
    38. 38. “MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]"SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference onComputing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [withP.Wu, X.Wang, J.Zhang]“Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, acceptedto ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang] "Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on DataMining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan andY.Zhang]" Mobile Lifelogger - recording, indexing, and understanding a mobile users life", in the Proceedings of The SecondInternational Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [WithS.Chennuru, P.Cheng, Y.Zhang]"SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang]"Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of TheIEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-KaiPeng, Pang Wu, and Ying Zhang]"Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEEGlobal Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with FeiXiong, Dongzhen Piao, Yun Liu, and Ying Zhang]"A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F.Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang]"Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference onInformation Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang] 38
    39. 39. Thank you.