Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation

63 views

Published on

Georg Knabl in Bucharest, Romania on November 8-9th 2018 at DefCamp #9.

The videos and other presentations can be found on https://def.camp/archive

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation

  1. 1. Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation Georg Knabl
  2. 2. Georg Knabl • self-employed IT-Consultant & Software Engineer at • based in Graz, Austria • areas of expertise • machine learning implementations • web development • information security 2
  3. 3. 3
  4. 4. The Problem with Human Passwords 4
  5. 5. A Human Attack Vector • people use password creation schemes • types • machine-random (&CtAEaCp?b&v"s%) • human-general (123456) • human-individual (John1970!) • human-random (randomly typed, 34ghjk34f3hjkHGFC) • What about correct horse battery staple? • issues • reduced entropy • attacker: knowing scheme (+ personal data) => password • humans limited in creativity  somebody else might have come up with same scheme  schemes publicly available in password leaks 5
  6. 6. Attacking Passwords 6
  7. 7. Traditional Approaches Hybrid or rule- based •dictionaries •word- mangling rules Markov Models •high- probability character sequences Masks •reduce set to typical structures Brute-force •try every possible combination 7 key space (Dunning, 2016) • tool support: hashcat, John-the-Ripper, PACK, CeWL, CUPP, …
  8. 8. Dictionary Sources • password leaks: rockyou.txt, exploit.in, … • tailored lists • CeWL: web scraping • CUPP: pre-defined questions 8 Analytics Website Designs Webdesign Rebranding passionately simply Factory … smithJohn@* smithJohn@@ smithJohn_1 smithSmithy smith_ smith_01 smith_01050 … 123456 12345 123456789 password iloveyou princess 1234567 12345678 abc123 …
  9. 9. Machine-generated Text 9
  10. 10. Neural Networks 10 • analyze huge datasets • learn hidden structures • reproduce structures on new data • supervised learning process: train on data generate model use model to analyze/generate
  11. 11. Recurrent Neural Networks (RNN) • learn, analyze, reproduce sequences • password = sequence of characters • password list: next password  n: just another character 11 (Olah, 2015)
  12. 12. RNN Tokenization 12 0 a 1 b 2 c 3 d 4 e … … 92 n „abc“ source data training generation target data „cde“0, 1, 2 2, 3, 4
  13. 13. char-rnn • RNN predicts character sequences based on training text • by Andrej Karpathy • https://github.com/karpathy/char-rnn 13 (Karpathy, 2015)
  14. 14. Works of Shakespeare 14 training output (Karpathy, 2015)
  15. 15. Linux Source Code 15 training output (Karpathy, 2015)
  16. 16. rockyou.txt 16 training output
  17. 17. General Human Passwords Guessing • Neural Networks outperform other methods at above 10^10 guesses • (almost) infinite number of passwords 17 (Melicher et. al., 2016)
  18. 18. Exploiting Individual Human Password Schemes A Machine Learning Approach 18
  19. 19. Relevance • most passwords have individual context • individual details publicly available (OSINT) • social media  harvester scripts • website user tables  leaked database dumps • … 19 exploit.in
  20. 20. Tailored Password Lists 20 training output John2050 180374 09091958 06031982 160883 soni John! john! j0hn.5m17h john.smith Smith866 asdfghj John50
  21. 21. Data Protection Compliance • EU-GDPR (General Data Protection Regulation) • significant fines • up to 20 mio. € or 4% of worldwide annual revenue • processing personal data requires consent • password lists contain personal information •  publicly available leaked data illegal • imbalance • info-sec researcher: has to comply & find (less ideal) alternatives • attacker: ignores regulations & trains on best available data 21
  22. 22. Data Protection Compliance • compliant solutions to collect data • general passwords: • use e.g. top-100,000 passwords list  no personal details contained • individual details + passwords: • compliance based on "public interest"? (GDPR Art. 6 (1) (e)) • collect consent from users  requires broad access to user data a) directly store & relate data until training is finished  requires password storage in plaintext (!!!) b) only store tokenized password schemes without user relation  requires all relatable personal data to be known at password hashing time 22
  23. 23. Challenges • generate password sequences ✓ • GDPR compliance ? • recognize & relate individual structures ? • How to relate personal data? • same scheme, different character sequences <first name><year of birth>! John1985!, Jane1992! • dealing with obfuscations ? • e.g. Leetspeak, all upper/lower case j0hn1985!, JOHN1985!, john1985! 23
  24. 24. Generating a Dataset Containing Individual Details • starting point: any password leak that contains a personal identifier • char-rnn requires > 50,000 entries for proper results • e.g. exploit.in (797 mio. credentials): <email address>:<password> • collect, match and attach personal details to entries • e.g. using social media harvester 24
  25. 25. Generating a Dataset Containing Individual Details 25 Gender Username First Name Last Name Year of Birth Password f margarete Judy Wells 1972 Wells106 f sondra Lucia Morrow 1950 cvbnm f zakia Gale Weiss 1999 syndikat f eada Ana Elliott 1994 Ana94 f karalee Denise Hanson 1965 OLIVER m agatha Edmond Daniels 1956 Agatha … • example result:
  26. 26. Password Schemes Used • Random: random choice of top-X password list (e.g. 123456) • Easy to Type: nearby characters on keyboard (e.g. qwerty) • Username: use person‘s username (e.g. smithy) • First Name + „!“: use person‘s first name plus exclamation mark (e.g. John!) • Lowercased First Name + „!“: use person‘s lowercased first name plus exclamation mark (e.g. john!) • Last Name + Random Int: use person‘s last name plus a three digit integer at the end (e.g. Smith758) • Username Leetspeak: use person‘s username in Leetspeak (e.g. 5m17hy) • First Name + Year of Birth (4 digits): use person‘s first name plus their year of birth (e.g. John1985) • First Name + Year of Birth (2 digits): use person‘s first name plus their year of birth in two digits (e.g. John85) 26
  27. 27. Tokenization • replace personal details with column id • column id is just another character • problem: exact matching fails to match obfuscations or abbreviations • John != j0hn • 1986 != 86 27 # First Name Year of Birth Password Resulting Password Tokens 1 Max 1983 Max1983! column: First Name, column: Year of Birth, ! 2 John 1986 John86! column: First Name, 8, 6, ! 3 Max 1987 123456 1, 2, 3, 4, 5, 6
  28. 28. Support Matching Using Data Variations • add on-the-fly word mangling rules to columns • Leetspeak • lowercase • uppercase • … 28 f f f F tania 74n14 tania TANIA Kara k4r4 kara KARA Rosales r054135 rosales ROSALES … f tania Kara Rosales …
  29. 29. Challenges • generate password sequences ✓ • GDPR compliance ✓ •  use top-X password lists + fake rules • recognize & relate individual structures ✓ •  column ids instead individual details • dealing with obfuscations ✓ •  on-the-fly word mangling rules to extend columns 29
  30. 30. Implementation • Python application based on Sean Robertson's pytorch-char-rnn • https://github.com/spro/char-rnn.pytorch • adaptions (excerpt) • matrix-based individual detail matching • on-the-fly word-mangling rules 30
  31. 31. Training 31 Whn carickte aanhls cshscarn suasso ail zpkoty beigedl 11883469 aw aeeenl aiseie enal faedni bnoxtln Wh ronis25 44353133 maty 0598971 treames bicken ratont tulie stocker shathos netrer derfa tolei dorled Wh ge butter jackout 05081984 lllllll sian harder chedle raven 11021985 supers 17031988 spike duddick epoch 10 epoch 40 epoch 280
  32. 32. Attacking the Target • collect data about victim & generate dataset • use trained model to generate a tailored password list • quality of list depends heavily on • selected training data • hyperparameter configuration 32 Gender Username First Name Last Name Year of Birth m john.smith John Smith 2050
  33. 33. Results & Qualitative Analysis 33
  34. 34. Scheme Adoption 34 John2050 180374 09091958 06031982 160883 soni John! John! [skipped until line 14] john! [skipped until line 23] j0hn.5m17h [skipped until line 30] john.smith [skipped until line 80] Smith866 [skipped until line 85] asdfghj [skipped until line 514] John50 [...] Random: stochastic character generation (mostly human dates) First Name + Year of Birth (4 digits): learned Username Leetspeak: learned using word mangling Last Name + Random Int: partially learned + stochastic generation Lowercased First Name + „!“: learned using word mangling First Name + „!“: learned Easy to Type: learned Username: learned First Name + Year of Birth (2 digits): partially learned + stochastic generation Duplicate because of few available rules Gender Username First Name Last Name Year of Birth m john.smith John Smith 2050
  35. 35. Proving Password Scheme Adoption 1. use new fake dataset with same schemes 2. loop through each entry and generate a individual password list (1000 entries) 3. check if password is on that list 35 Gender Username First Name Last Name Year of Birth Password f margarete Judy Wells 1972 Wells106 ?
  36. 36. Results • 6 models with different configurations • all models match about 70% in password lists of only ~100 lines • optimized configurations increase matching efficiency • recreated distributions of schemes 36
  37. 37. Mitigation 37
  38. 38. Mitigation Strategies • generating own model and check user‘s password against generated lists • attacker‘s model and dataset not available  password lists will differ • long or complex passwords • passwords might still be guessed if they contain personal information • e.g. JohnSmith1985 is actually <column: firstname><column: lastname><column: year of birth> • treating all human-like passwords as insecure • requires classification of human likeliness 38
  39. 39. Human Password Classification • using machine learning to classify human likeliness • dataset (80k human + 80k machine labeled passwords) • classifiers • Logistic Regression • Multinomial Naïve Bayes • Linear Support Vector Machine • Random Forest • vectorizers • TFIDF • Count 39 &CtAEaCp?b&v"s% m -SUuf4TLtF m mallrats h bP0.}BO/L&{: m ^=c.rgH$z m boxers h j&uzHCutff_A{ m 656565 h 6>IB|~@4^n}K m forever1 h …
  40. 40. Results accuracy human vs. machine-random: 99% correct 40 14061966 0.9961306540 y-JQ6{v;_yb|q 0.0000000000 ZBT4n#z-x 0.0000121259 longball 0.9920406811 vikings 0.9723564484 gunit 0.9683620674 .XP?]b36nP]l| 0.0000000000 8J9{Bd^ 0.0000107884 123india 0.9986476258 *[qg;t 0.0000058089 …
  41. 41. What about randomly-typed passwords? • human-random passwords • almost impossible for humans to distinguish • previously trained model: 83% correct • specifically trained model (human-random vs. machine-random): 94% correct 41 ,asgl213 HGHfwjiofjiw!? FEA452 dciuowed7983zy_ jksdgf644kjbndf Xkkeelt7tad5z sabjas012 123jfmvfkfn49fvk. …
  42. 42. Demo 42
  43. 43. Conclusion • machine learning can be used to efficiently attack passwords created by humans • mitigation • treat human passwords as insecure • warn users or provide password policy  use machine learning model to identify human passwords  integrate on web servers & password storage services 43
  44. 44. Resources • Thesis Machine Learning-driven Password Lists: • https://www.researchgate.net/publication/328719 001_Machine_Learning-driven_Password_Lists • Human Password Classifier: • https://github.com/georgknabl/human-password- classifier • ready-to-use trained models available via e-mail 44
  45. 45. 45 "The only secure password is the one you can't remember." Troy Hunt (haveibeenpwned.com)
  46. 46. Contact 46 DI (FH) Georg Knabl, MSc IT-Consultant & Software Engineer georg.knabl@pageonstage.at
  47. 47. Sources • Dunning, Julian (2016). Statistics Will Crack Your Password. Available from: https://p16.praetorian.com/blog/statistics-will-crack- yourpassword-mask-structure [Mar. 3, 2018] • Karpathy, Andrej (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. Available from: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ [Nov. 10, 2017] • Melicher, William, Blase Ur, Sean M Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor (2016). „Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks“. In: 25th {USENIX} Security Symposium ({USENIX} Security 16). Vancouver: {USENIX} Association, pp. 175–191. • Olah, Christopher (2015). Understanding LSTM Networks. Available from: http://colah.github.io/posts/2015- 08-Understanding- LSTMs/ [Nov. 10, 2017] 47

×