Exploring Capturable Everyday Memory for Autobiographical Authentication, at Ubicomp 2013


Published on

We explore how well the intersection between our own everyday memories and those captured by our smartphones can be used for what we call autobiographical authentication—a challenge-response authentication system that queries users about day-to-day experiences. Through three studies—two on MTurk and one field study—we found that users are good, but make systematic errors at answering autobiographical questions. Using Bayesian modeling to account for these systematic response errors, we derived a formula for computing a confidence rating that the attempting authenticator is the user from a sequence of question-answer responses. We tested our formula against five simulated adversaries based on plausible real-life counterparts. Our simulations indicate that our model of autobiographical authentication generally performs well in assigning high confidence estimates to the user and low confidence estimates to impersonating adversaries.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Today, I want to talk about a new form of authentication for smartphones: Autobiographical Authentication.Autobiographical Authentication is rooted in the observation that we can use shared information between user and smartphone as the basis for identifying users.But, you might wonder, why do this at all?
  • Welllll, no.Presently, mobile authentication looks a lot like this: PINs, passwords, or graphical passwords—all forms ofAuthentication that have worked well in and been ported from the desktop and laptop worlds.
  • But, while smartphones are increasingly used for security sensitive tasks, as many as 50% of users do not usethese forms of authentication on their phones.One problem stifling adoption might be that, authentication, today, is generally a non-mobile experience. Perhaps we can do better by using the extra bits of information provided by smartphone sensors. But,how?
  • Well, smartphones capture incredibly rich information about our day to day lives. Frankly, they know a lot about us.Think about it:
  • We carry them everywhere, and use them in almost all situations.
  • They are our primary means of remote communication with friends, loved ones, work colleagues and even strangers.
  • They are actuated with rich sensors that allows them to know where we are at any give time,
  • and where we’re going.
  • We also use them extensively, and sometimes exclusively, to browse the web.
  • And to take photos of people browsing the web.
  • And the apps. Between ToDo Lists, Games, Maps, and Mail, there’s little about our lives that isn’t encoded somewhere in our phones.
  • So, yeah. Smartphones know a lot about us.In turn, human memory might accurately recall some subset of these rich smartphone logs. This intersection, is what I refer to as“capturable everyday memory.”
  • In this project, I wanted to explore to what extent capturable everyday memory can be used as the basis for authentication.How? By asking users a series of questions about their day to day experiences, such as:
  • “what did you eat for lunch yesterday?”,
  • “What application did you use at 1pm?”,
  • “who did you call yesterday at 4pm?”. This is what I call Autobiographical Authentication. But that’sA mouthful, so I’m going to abbreviate it to AutoAuth for this presentation.
  • So, in pursuit of this grand vision, I offer two contributions through the present work.
  • First, I offer a model of capturable everyday memory. I answer how well can users answer questions about capturable everyday memory and what factors affect its retention.There is a plethora of past work on everyday autobiographical memory, and I use this to inform the factors I consider in my model (for example, age, gender, and answer uniqueness).But, as far as I know, no work before mine has tried to model the intersection of smartphone logs and everyday memory.
  • Second, I offer a framework for autoauththat allows for natural user error.In other words, I answer the question: how do we move from raw, noisy question/answer responses to an authentication decision?Human memory is not perfect. For autoauth to work, then, Inaccuracies should be expected, and handled gracefully. While prior work has noted the promise of using dynamic autobiographical questions for authentication, no work to my knowledgehas actually developed an end to end framework for dealing with error and making authentication decisions.
  • So that’s neat, but what did I actually do?
  • Our approach was two-pronged. To get a broad understanding of the questions that could potentially be asked,I ran two open-ended mechanical turk studies based on self-report data. Next, to get a more detailed, rigorous handle on the questions that can actually be asked based on present sensors and system logs, I deployed an Android app that actually indexed ground truth smartphone logs and generated questions based on them.For today’s presentation, I will focus only on the results of the field study. Please refer to the paper for the Mturk results.
  • So let’s talk a little more about the field study it self.
  • The app I built, myAuth, indexed just about all universally accessible information on Android phones, including pertinent sensor readings (such as location sensor readings),app usage information and user’s call and sms logs, and web browsing and search history.From this information, myAuth was able to ask 13 different questions.
  • These are the 13 questions, along with a corresponding shortened code for easy reference. These questions were selected because they were considered memorable in the Mturk self-report studies.
  • You might notice that the first 8 questions have a dynamically generated time tag. These questions are what I call specific fact based questions, or just fact-based questions for short. The abbreviations for these questions are prefixedwith an FB.These questions were generated with a specific answer in mind. For example, “Who did you SMS message on Wednesday, August 21st at 1:34pm?” There is only one correct answer to this question.
  • On the other hand, the last five questions have no such time tag. These questions are what I call“name-any” questions, and have abbreviations prefixed with an NA. For these questions, any correspondingevent that occurred within the past 24 hours was considered a correct answer. For example,Consider the question “Name someone you SMS messged in the past 24 hours.
  • Fact-based questions could also be recognition or recall questions, or multiple-choice vs free-response questions.For recall questions, users had to enter an open-ended response to answer the question. For recognition questions,Users had to select the correct answer out of 10 answers, including some near-miss distractors. Near-miss distractors were answers that were almost right, for example, the name of someone you text messaged at 1:20 instead of 1:34.
  • Moving on to the study design, participants had to answer at least 5 questions a day for 14 days and were given monetary incentives to answer questions correctly.They were also able to skip questions they found uncomfortable answering.
  • I recruited 30 users from CBDR, and 24 of them actually followed through with the study. Participants were instructed to download myAuth through Google Play.Participants were, on average, 25 years old and 14 of them were male.Over the course of the 2 weeks, these 24 participants generated 2167 question-answer responses.
  • Now to the interesting stuff, the results! Note that there is a much more detailed description of the analysis and methods usedTo model capturable everyday memory In the paper. Here, I focus on high-level takeaways that directly inform AutoAuth.
  • First, the question on everyone’s mind: how well could users answer these questions?1381 out of the 2167 responses, or 64%, were answered correctly with an additional 168 near misses.As expected, there participants performed well but not perfectly. They generally get questions right, but it is hardlyUncommon for them to get questions wrong.
  • Moving on, the next question we wanted to answer was how much better is recognition than recall, as recall questions have betterSecurity properties.Unsurprisingly, multiple choice questions are answered significantly more accurately than free-response questions,at 68% versus 62%.Even so, 68% accuracy is hardly stellar. There is still substantial inaccuracy that will need to be dealt with in making authentication decisions.
  • We also wanted to answer whether or not there was a substantial learning effect. In other words,Even if users are initially bad at answering these questions, do they get better at it?In fact, performance appeared to be stable over the short term, with a negligible learning effect.Indeed, there was no statistically significant difference between the accuracy of the first 20% and the last 20% of responses by users.
  • Next, we wanted a better handle on how well users could answer different questions(click)It seems that fact-based questions about Incoming Calls and Outgoing SMS Messages were answered most correctly at just over 80%,(click)While questions about what users searched for on the internet or what application they were using at a specific time were answeredCorrectly at much lower rates: just 28% for application questions, and 18% for web search questions.(click)In general, it seems that users are good at answering questions about communication,(click)but bad at answering questions about phone and web usage.
  • Finally, we wanted to know how much does time-bucketing help, as fact-based questions have better security properties.But, surprisingly, time-bucketing did not help. There was no statistically significant difference between how often users got fact-based and name-any questions correct.This is good news from a security perspective.
  • So how can we turn all of what we just learned into an authentication system?
  • Well, we know the straightforward approach would not work. AutoAuth is not like passwords, and we cannot expect that a user willAnswer every question presented correctly.On the other hand, users’ performance appears to be stable over the short term, and it seems that users make systematic, predictableerrors.
  • And there is the key insight. Given the systematic nature of user errors, incorrect answers can also be a signal.In other words, we can use both correct and incorrect answers as indicators that the attempting authenticator is, in fact, the user.A simple example would be that if I generally get questions about location correct, but questions about outgoing SMS messagesIncorrect, autoauth should look for that pattern in my responses.
  • So, the operationalization of that insight is this mess that we call the confidence estimator: the cornerstone of AutoAuth.Note that my goal here is to simply provide you with high-level intuition. The details of how we arrived at this equation are specificallyoutlined in the paper; here I’ll just summarize each of these terms. (click)(click)Our estimator takes in three parameters: a user model derived from the training data, u; the observed sequence of question/answer responsesFrom an authentication session, seq; and a notion of an ‘adversary’, u hat. The adversary model is a computational representation of anImpersonator that we believe might be trying to crack the system.(click)Given these three parameters, our estimator outputs a confidence estimate from 0 to the prior probability that the attempting authenticator is the user, We compute this estimate as the product of two terms: one that encapsulates the relative likelihood thatwhat we observe comes from the user and not the adversary, and a second that encapsulates how well what we observematches with what we expect from the user’s prior performance.(click)More specifically, the first term is a Bayesian model that encapsulates the likelihood that seq comes from the userand not the adversary. We need an adversary model for this term because in order to compute the likelihoodthat an observed sequence of responses comes from the user, we also need the likelihood that it does *not*come from the user. Our adversary model is simply a simplification all possible definitions of “not user”.(click)Finally, the third term encapsulates how well what we observe matches what we expect, given that the attemptingAuthenticator is the user. For example, if a user generally gets questions about location correct, we expect theuser to continue to get these questions correct in the observed sequence. If that is what we observe, then thisValue will be high. If what we expect is not what we observe, this value will be low.
  • To summarize, AutoAuth takes in a sequence of autobiographical question-answer responses and an adversary model and outputs a confidence score from0 to the prior probability that the authenticator is the user.
  • That all seems great in theory, but how well did the model perform, actually?
  • Specifically, I wanted to answer the question: if our field study users were actually using AutoAuth, what confidence scores would they be getting?What confidence scores would adversaries get?The first step towards answering these questions was to pick the adversaries to simulate, both weak and strong.
  • In total, I simulated 5 adversaries based on plausible real life counterparts. For this presentation, I will only talk about 3: 1 simple and 2 advanced.Please refer to the paper for the others.
  • The Naïve adversary is given a 1/10 chance of guessing any question correct, simulating a random Guess on a recognition question. This adversary might represent a complete stranger who steals the user’s phone.
  • The Always Correct adversary, or ACA, always answers every question correctly.While the concept is simple, this adversary is quite advanced, potentially representing malwareThat has compromised the knowledge base but not the authenticator.
  • The Empirical Knows Correct Adversary, or EKCA, is the most advanced adversary I consider. This adversary both knows the correct answer to Every question and is smart about when he answers correctly.The EKCA has created an empirical probability distribution that any user will get any type of question correct from a separate set of users, and uses this empiricaldistribution to selectively answer correctly or incorrectly to best emulate the empirical average.
  • So, the how well does AutoAuth do? Specifically, what sort of confidence scores do users get when our confidence estimatoradopts the aforementioned adversaries?
  • So, we have this graph. The x-axis represents the number of questions answered in one single session, ranging from 1-13.The y-axis represents the authentication confidence output from our confidence estimator, ranging from 0-95, where 95Is the prior probability that the attempting authenticator is the user.This graph plots the mean confidence estimate our field study users get from AutoAuth as they answer an increasingNumber of questions. The translucent shading around the plot is the 95% confidence interval. There are two things to pay attention to here. One is the direction of the trend. We want the plot to move up and toThe right. After all, if real users get decreasing confidence estimates, our estimator would not be doing a veryGood job! The second is the actual confidence values. We can see that when actual users attempt to authenticateWhere the adversary is naïve, users quickly get very high confidence scores.Note that 80 is a fairly high confidence score because the maximum value would imply that the data we observeHas a 0% chance that it comes from the adversary and a 100% chance that it matches our expectations of theUser. This is very unlikely in practice!.
  • Here, I’ve superimposed the confidence scores that real users would obtain if modeled against an Always Correct Adversary.This adversary is simple, but represents a very salient threat: a stalker who carefully chronicles a user or a malicious programThat has direct access to the knowledge base.The results are quite promising. We can see that while the initial confidence is markedly lower than when modeled again the naïve adversary,Users quickly obtain high confidence estimates even against this powerful adversary.
  • Finally, I’ve superimposed real user’s confidence estimates when modeled against the most powerful adversary: the EKCA.Not suprisingly, the confidence estimates offered by AutoAuth are much more conservative when modeling against thisAdversary. It is, after all, an incredibly sophisticated adversary. Nevertheless, the trend is still upwards, suggesting thatUsers can still achieve high confidence scores, albeit with a greater time investment.In general, it seems that AutoAuth always yields a reasonably high confidence score when actual users are trying to authenticate.These results are encouraging when you remember that the ACA and EKCA are actually fairly sophisticated adversaries that haveAccess to incredibly sensitive data about the user already.
  • Next, we wanted to see how well adversaries could impersonate users, so we simulated the confidence scores our three adversaries would obtainwhen modeled against each other.
  • The next question is how well adversaries can impersonate users.This plot shows how the user and our three adversaries perform after answering five and eight questions in one session, when modeled against a naïve adversary.The actual user’s performance is also shown for a comparison point.We should see that the confidence score for the adversary modeled against should be low, while the user’s confidence should remain high.The bars for adversaries not modeled against may be low or high. Ideally low, but it would not be surprising if they were high.We see exactly what we expect. The user performs very well, while the naïve adversary itself performs very poorly. However, advanced adversaries like the ACA and EKCA also obtain high confidence scores when AutoAuth adopts the naïve adversary model.
  • But there is hope! Here we see the same graph, except with scores derived when modeled against the AlwaysCorrect Adversary.This graph is very encouraging, as we can see the ACA’s confidence score drops dramatically: from near 80 to less than 20.In other words, when specifically modeled against, the Always Correct Adversary acquires verylow confidence estimates from AutoAuth. Users themselves, however, continue to acquire high scores, suggesting aNegligible usability hit when adopting this stronger model.
  • Lastly, we see the same graph with scores modeled against the EKCA.AutoAuth generally produces much more conservative estimates across the board whenIt adopts this powerful adversary model, as we would expect. Nevertheless, the user still performsWell relative to impersonators, while the EKCA performs less well and obtains lowerscores as she answers more questions. This is especially encouraging, because it suggestsThat the EKCA cannot simply try and answer more questions to get a higher score.
  • In sum, autoauth performs reasonably well.Users always get relatively high and increasing confidence estimates, no matter the adversary modeled against.Furthermore, AutoAuth can easily obtain against simple adversaries like the naïve adversary, which is great whenYou consider that these are by far the more common adversaries.On the other hand, advanced adversaries can fool AutoAuth when it adopts a simple adversary model, but even these adversaries encounter road blocks when specifically modeled against.These are all promising results, suggesting that this AutoAuth might actually be useful in practice!
  • So what have we learned from all of this?
  • One of the key points I hope you takeaway is that while people are not flawless at answering questions aboutCapturable everyday memory, even their inaccuracies can be used as useful signals for authentication!Indeed, Autobiographical Authentication does just that and shows a lot of promise. It accountsFor systematic response error and performs well even against some incredibly sophisticatedAdversaries.
  • But as with any study, there are limitations with this one that offer fruitful avenues for future work.For one thing, autobiographical Authentication is slow. It takes 22 seconds on average for users to answerEach question.Additionally,AutoAuth requires constant device usage to replenish itsknowledge base. Most of its niceProperties only arise in environments where the same question never has to be asked twice.Finally, it remains unclear how users will react to this sort of authentication. Edge cases, such as when a user genuinely but uncharacteristically answers all questions correctly, might break usability unless they can be handledgracefully.
  • With that, I’ll open the floor to questions.
  • Finally, a word on the practical use cases for AutoAuth as it stands.As AutoAuth is presently slow, one can imagine that while not a replacement for passwords, autoauth might be useful for password replacement.Presently, challenge questions that query about static facts are used as password replacement challenges. AutoAuth might be a more robust andSecure replacement.AutoAuth is also really amenable to Context-Aware Scalable Authentication, which scales authentication to be more challenging or simpleBased on a risky context. One can imagine scaling the minimum threshold confidence required for authentication with AutoAuth.Also, as AutoAuth is dynamic, many threats such as brute-force attacks and shoulder surfing become less potent.Finally, AutoAuth can also conceivably be used for tiered authentication, where all data is not treated equal. Read access to text messagesMight require a low confidence, while write-access to security settings might require high confidence.
  • So you might be wondering at this point: what does autoauth offer? Here are two of my favorites.First, autoauth is pretty neat in that it can be scaled by context. In other words, if you’re in a risky context (for example, in an unknown city for the first time), autoauth might ask you to answer 5 or 6 questions. But if you’re at home, autoauth might ask you to answer just 1 question.Second, autoauth allows authentication to be dynamic and ever changing. Thus, attacks built with static challenge response in mind can be muchHarder to execute. For example, even if someone sees me answering a question once, that doesn’t necessarily give them insight into the questionThey would have to answer if they stole my phone.
  • This graph plots participants’ mean confidence score along with the 95% standard errors after answering a certain number of questions within a single session.The different lines represent the user’s confidence scores computed when modeled against different adversaries.The first encouraging result is that no matter the adversary, the user’s confidence score appears to increase as he answersmore questions.Unsurprisingly, AutoAuth offers the highest confidence estimates when modeled against the weakest adversary: the naïve adversary.In this case, after just 5 questions, the user achieves a confidence score close to 80.AutoAuth offers the most conservative estimates when it adopts the strongest adversary model: the EKCA.Nevertheless, the trend is still upwards, suggesting that users can answer more questions to achieve higher confidence.In general, it seems that AutoAuth always yields a reasonably high confidence score when actual users are trying to authenticate.
  • So that’s a cool idea, but how do we operationalize it? Well, what we have is a user model with training data, u. Let’s say weHave an observed sequence of question-answer responses, seq, that represents an authentication session and we’reTrying to make an authentication decision.Given this information, we want to compute something like P(u | seq)—the probability that the attempting authenticator isThe user, given the observed sequence of responses.
  • Bayes Theorem tells us a lot about how to get what we want from what we have.(click)The first term in the numerator is P(seq | u), the probability we would observe the given sequenceOf responses from the user. This is easy to calculate from training data. ImagineI’m trying to log in to my phone, and it asks me: What did you eat for lunch yesterday? My phoneKnows, from the training data, that I answer this question correctly 70% of the time, so P(seq | u)Is just that: 0.7.(click)The second term in the numerator, P(u), is simply the prior probability that the authenticator is the user.For personal mobile devices, this is just a high constant. Something close, but not quite equal to 1.(click)The denominator, P(seq), represents the overall probability that we would observe this sequence of responses.
  • P(seq) is hard to compute because it requires knowledge of all possible impersonators.But, we can break down P(seq) into two components that make it more managable: the probabilityThat the sequence comes from the user plus the probability that the sequence doesn’t come from the user or P(seq | u hat).(click)u hat is what I call the adversary model. Rather than try and enumerate all possible non-users, we model a specific kind of non-userAnd use that model as a representation of who we consider our most likely impersonator.
  • Themodified equation takes in an additional parameter, u hat, and breaks the denominator down into the twocomponents I just mentioned.
  • But, you might have noticed that this simplification has introduced a problem.(click)To get a high score, an authenticator must only minimize the second term of the denominator: P(seq | u hat).
  • In other words, a clever impersonator can get a high score simply by being unlike the adversary, even if he is nothing like the user.
  • We can fix this by adding an additional term, the bit-string similarity between the expected and actual correctness of the observed sequence of responses.I won’t spend too much time explaining this term, but the important takeaway is that this term encapsulates how well the observed Sequence of responses matches our expectations if it were actually the user, on a scale of 0 – 1.Please refer to the paper for more details.
  • So, we finally arrive at the final equation.Our confidence in the user is the probability that the observed sequence comes from the user, given an adversary model timesThe similarity between the expected and observed sequence of responses.
  • To get a more detailed look at what affected a participant’s ability to get a response correct or incorrect,I modeled the correctness of a response with a mixed-effects logistic regression with the user as aRandom effect.In short, I used a mixed-effects model because I collected multiple responses per user and becauseIt allows for each user to have his/her own baseline likelihood to get a question correct.For more on the modeling technique, please refer to the paper.
  • These are the fixed effect controls I included in the model. As you can see, I controlled for demographic variables thatlike age and gender that might affect memory retention, and included several other features in the model includingQuestion type and answer uniqueness.However, I’ll primarily be focusin
  • Another curious finding is that questions with more unique answers were answered incorrectlyMore often than questions with less unique answers.This seems unintuitive; surely, more unique events are more memorable! However, this finding suggests that users may not actually “remember” answers, but simple “know” them. In other words,If asked ‘Where were you on Thursday at 1pm?”, I might say Newell Simon Hall because that’s where I generallyam on Thursdays at 1pm, whether or not I specifically remember being there at that time.This heuristic works well for the majority case, but breaks down when we stray from routine.----- Meeting Notes (8/20/13 10:08) -----explain graph more. y-axis, x-axis,
  • So that’s a cool idea. How do we actually get to it?Imagine a simplified system with only two question types. We have some training data from a user,From which we know how often a user gets both types of questions correct, in the left table. We also haveAn attempted authentication into this user’s smartphone, in the right table.How can we get from the raw question-answer response to a confidence estimate?
  • The first thing we can do is calcualte the probability that we would observe this question-answer response sequence from the user.We do that by simply plugging in and multiplying.The attempting authenticator got the first question of type QT1 correct, and the second question of type QT2 incorrect.Our training data tells us that there is a 42% chance we would observe this sequence from the actual user.
  • Now, P(seq | u hat) can potentially be problematic as well, but we can simplify the problem by adopting a specific adversary model.In other words, u hat is not all possible non-users, but a model of non-users that we believe is our “adversary”, or the most likely impersonator.
  • For the location question, or Where were you at <time>?, users were asked to select their location on a map. Any location selected within the error radiusOf the sensor reading at the time was considered correct.
  • Exploring Capturable Everyday Memory for Autobiographical Authentication, at Ubicomp 2013

    1. 1. Exploring Capturable Everyday Memory for Autobiographical Authentication Sauvik Das, Eiji Hayashi, Jason Hong 1 {sauvikda, ehayashi, jasonh}@cs.cmu.edu
    2. 2. Mobile Authentication Today 2
    3. 3. Mobile Authentication Is… 3 • Underutilized – Up to 50% do not use authentication. • Outdated – Ported from the desktop world. http://confidenttechnologies.com/files/Infographic%20small%20image.png
    4. 4. 4 Smartphones know a lot about us.
    5. 5. 5 We carry them everywhere.
    6. 6. 6 They are our communication hubs.
    7. 7. 7 They know where we are.
    8. 8. 8 They know where we’re going.
    9. 9. 9 We use them to browse the web.
    10. 10. 10 …and to take photos.
    11. 11. 11 And the apps! Oh, the apps we use!
    12. 12. Key Observation 12 Smartphone Logs Human Memory Capturable Everyday Memory
    13. 13. Key Observation • Capturable Everyday Memory can be used as the basis for a series of autobiographical challenge-response questions. • This is what we call Autobiographical Authentication. 13
    14. 14. 14 What did you eat for lunch yesterday?
    15. 15. 15 What application did you use at 1pm?
    16. 16. 16 Who did you call yesterday at 4pm?
    17. 17. CONTRIBUTIONS 17
    18. 18. A Model of Capturable Everyday Memory • How well can users answer questions about capturable everyday memory? • What factors affect their performance? 18
    19. 19. A Framework for Autobiographical Authentication • How can we move from raw, noisy question- answer responses to an authentication decision? • How do we handle inaccuracies/lapses in human memory? 19
    20. 20. METHODOLOGY 20
    21. 21. Approach • Ran 3 studies to get a handle on both questions that could be asked and can be asked. • 2 Mturk studies based on self-report data. • 1 Field study with ground truth data. 21
    22. 22. FIELD STUDY 22
    23. 23. myAuth • Indexed “knowledge” on the phone: – Sensor readings – System-level content-providers • Capable of asking 13 questions. 23
    24. 24. 24 QType Description FBApp What application did you use on <time>? FBLoc Where were you on <time>? FBOCall Who did you call on <time>? FBInCall Who called you on <time>? FBOSMS Who did you SMS message on <time>? FBInSMS Who SMS messaged you on <time>? FBIntSrc What did you search the internet for on <time>? FBIntVis What website did you visit on <time>? NAOSMS Name someone you SMS messaged in the last 24 hours. NAInSMS Name someone who SMS messaged you in the last 24 hours. NAOCall Name someone you called in the last 24 hours. NAInCall Name someone who called you in the last 24 hours. NAApp Name an application you used in the last 24 hours.
    25. 25. 25 QType Description FBApp What application did you use on <time>? FBLoc Where were you on <time>? FBOCall Who did you call on <time>? FBInCall Who called you on <time>? FBOSMS Who did you SMS message on <time>? FBInSMS Who SMS messaged you on <time>? FBIntSrc What did you search the internet for on <time>? FBIntVis What website did you visit on <time>? NAOSMS Name someone you SMS messaged in the last 24 hours. NAInSMS Name someone who SMS messaged you in the last 24 hours. NAOCall Name someone you called in the last 24 hours. NAInCall Name someone who called you in the last 24 hours. NAApp Name an application you used in the last 24 hours.
    26. 26. 26 QType Description FBApp What application did you use on <time>? FBLoc Where were you on <time>? FBOCall Who did you call on <time>? FBInCall Who called you on <time>? FBOSMS Who did you SMS message on <time>? FBInSMS Who SMS messaged you on <time>? FBIntSrc What did you search the internet for on <time>? FBIntVis What website did you visit on <time>? NAOSMS Name someone you SMS messaged in the last 24 hours. NAInSMS Name someone who SMS messaged you in the last 24 hours. NAOCall Name someone you called in the last 24 hours. NAInCall Name someone who called you in the last 24 hours. NAApp Name an application you used in the last 24 hours.
    27. 27. 27 RecognitionRecall
    28. 28. Study Design • Answer 5 questions a day for 14 days. • Incentive to answer questions correctly. • Could skip questions if they chose. 28
    29. 29. Descriptive Stats • 24 users – Average age: 25 (s.d. 6.25, range 18-43) – 14 male (58%) • 2167 question-answer responses collected 29
    30. 30. FIELD STUDY RESULTS 30
    31. 31. 1381 questions answered correctly (64%) + 168 near misses (8%) 31
    32. 32. Recognition > Recall 68% recognition questions answered correctly vs. 62% recall questions (p = 0.008). 32
    33. 33. Performance Stable Over Time No difference between first and last 20% of responses (64.1% vs. 64.8%, p = 0.73). 33
    34. 34. Question Type Matters 34
    35. 35. Time Bucketing Does Not Help No difference between “Fact-Based” and “Name-Any” questions (64% vs. 63%, p=0.5). 35
    37. 37. Users only get 64% of answers correct. But, performance is stable and errors are systematic. 37
    38. 38. Given the systematic and stable nature of user errors, we can make an authentication decision given both correct and incorrect answers. 38
    39. 39. Confidence Estimator 39 C(u| seq,û)= P(u| seq,û)*S(seq |u) Confidence that the attempting authenticator is the user. Range 0 – P(u). Probability that the observed sequence comes from the user, given an adversary model. Value, from 0-1, that the observed sequence of responses matches what we expect from the user. User model Observed question/answers Adversary model
    40. 40. AutoAuth takes in a sequence of autobiographical question-answer responses and an adversary model, and outputs a confidence score from 0-P(u). 40
    41. 41. EVALUATION 41
    42. 42. Evaluation Plan • Use field study data to run simulations. – What confidence scores would users get versus impersonators? • Simulate 5 probable adversaries, weak and strong. 42
    43. 43. 5 Adversaries Simulated Naïve Adversary Observing Adversary Always Correct Adversary Empirical Observing Adversary Empirical Knows Correct Adversary 43 Weakest Strongest
    44. 44. Naïve Adversary • Simulates complete stranger who steals phone. • Has a 1/10 chance of guessing the correct answer. Random guess on recognition question. 44
    45. 45. Always Correct Adversary • Simulates stalker, or software that has compromised the knowledge base. • Always answers correctly. 45
    46. 46. Empirical Knows Correct Adversary • Simulates adversary with all correct answers and an understanding of how an average user answers questions. • Purposely gets certain questions wrong to best simulate the “average” user. 46
    47. 47. USER PERFORMANCE Evaluation Results 47
    48. 48. 48
    49. 49. 49
    50. 50. 50
    51. 51. ADVERSARY PERFORMANCE Evaluation Results 51
    52. 52. Against Naive 52
    53. 53. Against ACA 53
    54. 54. Against EKCA 54
    55. 55. Evaluation Take-Aways • Users always get relatively high confidence scores. • We can easily defend against simple adversaries. • Advanced adversaries do better but can also be detected when modeled against. 55
    56. 56. CONCLUSION 56
    57. 57. Summary • People answered 64% of autobiographical questions correctly, on average. • Their errors can be signals, too! • Autobiographical Authentication is promising and robust against some tough adversaries. 57
    58. 58. Limitations • Autobiographical Authentication is slow (22 seconds on average per question). • Requires constant device usage to replenish the knowledge base. • Remains unclear how users will react to this sort of authentication in practice. 58
    59. 59. Questions? 59
    60. 60. Practical Use Cases • Password Reset. • Scalable, Dynamic Authentication. • Tiered Authentication. 60
    61. 61. Why might we want this? • Scalable by context. • Authentication is dynamic. – Shoulder-surfing, social engineering, brute-force attacks are all much harder to execute. 61
    62. 62. Usability Sanity Check 62
    64. 64. Systematic Response Error Model Given an observed sequence of responses, seq, and a user, u, we want something like: P(u | seq) 64
    65. 65. 65 Using Bayes Law P(u | seq) = P(seq |u)P(u) P(seq) Prior probability that it is the user. Probability that we would observe this sequence of responses from the user. Overall probability that we would observe this sequence of responses.
    66. 66. 66 Adopting an Advesary P(seq) is hard to compute perfectly: Requires knowledge of all possible impersonators. But we can break it into two components. P(seq) = P(seq |u)+P(seq |û) Adversary Model
    67. 67. Modified Bayesian Equation 67 P(u | seq,û) = P(seq |u)P(u) P(seq | u)+ P(seq |û)
    68. 68. Problem: To Get a High Score… 68 P(u | seq,û) = P(seq |u)P(u) P(seq | u)+ P(seq |û) Minimize this term.
    69. 69. Unlike Adversary Model Attack A clever impersonator can get a high score simply by being unlike the adversary model, even if s/he is not like the user. 69
    70. 70. Add Bit-String Similarity 70 S(seq |u) = n-| E(seq |u)-correct(seq)| n Number of responses. Expected answer correctness from user. Actual answer correctness from authenticator. Bit string similarity.
    71. 71. Final Equation 71 C(u| seq,û)= P(u| seq,û)*S(seq |u) Confidence that the attempting authenticator is the user. Range 0 – 100. Probability that the observed sequence comes from the user, given an adversary model. Bit-string similarity between the correctness of the observed sequence and the expected correctness of the observed sequence.
    72. 72. Modeling Capturable Everyday Memory • Used a Mixed-Effects Logistic Regression • Users as a random effect – Each user had his/her own baseline likelihood of getting a question correct (intercept). 72
    73. 73. Fixed Effect Description Age Integer age. Gender Male or Female. Time to Answer Number of seconds it took the user to answer the question. Time since Correct Answer Number of hours since the correct answer event occurred. Day of Study The day number of the study (0-13). Correct Answer Entropy The Shannon entropy of correct answers for this question type. Answer Uniqueness Inverse of the percentage of times that the correct answer to this question was the correct answer to this type of question. Confidence Self-reported confidence in the answer (1-5). Ease of Remember Answer Self-reported ease of remembering the answer (1-5). Difficulty of Others Guessing Self-reported perceived difficulty of others guessing the answer (1-5). Answer Type Recognition or Recall. Question Type The question type. 73
    74. 74. Model Coefficients 74
    75. 75. Unique Answers Hard to Remember 75
    76. 76. Systematic Response Error Model • A simplified system: – Two question types – Training data – An observed authentication attempt 76 QType Probability Correct QT1 0.7 QT2 0.4 Training Data Probability Distribution # QType Correct? 1 QT1 Yes 2 QT2 No Observed Question-Answer Response Sequence
    77. 77. Systematic Response Error Model • We can calculate the probability that we observe a sequence of responses given the user: 77 QType Probability Correct QT1 0.7 QT2 0.4 Training Data Probability Distribution # QType Correct? 1 QT1 Yes 2 QT2 No Observed Question-Answer Response Sequence P(seq |u) = P(correct(QT1)|u)P(incorrect(QT2)|u) = 0.7*(1-0.4)= 0.42
    78. 78. 78 Adopting an Advesary We can simplify the calculation of P(seq) by adopting a specific adversary model, û.
    79. 79. Location Question With Map 79