Making Logs Sexy Again: Can We Finally Lose The Regexes?


Published on

Making Logs Sexy Again: Can We Finally Lose The Regexes?, presented at DeepSec 2008 in Vienna, Austria

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • “ Will be high-level, but the subject is definitely pretty DEEP… so DeepSec” Making Logs Sexy Again: Can We Finally Lose The Regexes?   As I was talking to one conference organizer a few years ago and he complained that there hasn’t been anything new in the field of automated log analysis for many years and that we have not progressed much since regular expressions (regexes) were used for extracting bits of useful information from logs. This talk is intended to change that! It will go through a set of approaches to “log problem”: some completely new (text mining of logs), some borrowed from other fields (Bayesian methods for log analysis) and show how people can use them to solve practical problems using these new methods. Only real-world-tested analysis methods and only those that can be used by the audience will be shown (and, of course, only the “0technically cool” and novel ones!) Example questions that will be answered: are logs data or text? Which data mining methods works really well on logs (which is still simple enough to be used by the non-Ph.Ds)? Can you make sense of all the keywords and key phrases in logs? Does Bayes works for logs? Can you predict when the system will crash based on logs?
  • Making Logs Sexy Again: Can We Finally Lose The Regexes?

    1. 1. Making Logs Sexy Again: Can We Finally Lose The Regexes? Dr. Anton Chuvakin
    2. 2. Agenda <ul><ul><li>What do we do with logs? </li></ul></ul><ul><ul><li>What’s wrong with logs? </li></ul></ul><ul><ul><ul><li>Logging “Grand Challenges” </li></ul></ul></ul><ul><ul><li>What has been tried to make It right? </li></ul></ul><ul><ul><li>What’s next? </li></ul></ul>C O N F I D E N T I A L
    3. 3. Who is Anton? <ul><li>Dr. Anton Chuvakin [ formerly ] from LogLogic “is probably the number one authority on system logging in the world ” </li></ul><ul><li>SANS Institute (2008) </li></ul>
    4. 4. First: WTH is Logs? <ul><li>I. A structured and timed audit trail of system, network activities? </li></ul>C O M P A N Y C O N F I D E N T I A L II. A huge messy pile of undocumented, unstructured or poorly structured “ stuff ” that might or might not mean anything – at all?
    5. 5. WTF?!! <ul><li>%PIX|ASA-3-713185 Error: Username too long - connection aborted </li></ul><ul><li>ERROR: transport error 202: send failed: Success </li></ul><ul><li>userenv[error] 1030 RCI-CORPwsupx No description available </li></ul><ul><li>Aug 11 09:11:19 xx null pif ? exit! 0 </li></ul>
    6. 6. Second: What Do We Do With Logs? <ul><li>Ignore them </li></ul><ul><ul><li>by far the #1 popular choice! </li></ul></ul><ul><li>Collect them – and then ignore them </li></ul><ul><li>Collect them to show them to whoever wants them – and then nobody does  </li></ul><ul><li>Collect and look for “bad” stuff </li></ul><ul><li>Collect and parse them into a database – ignoring logs now costs more!  </li></ul>C O M P A N Y C O N F I D E N T I A L
    7. 7. <ul><li>Analysis? </li></ul>C O M P A N Y C O N F I D E N T I A L
    8. 8. Log Analysis Basics: Summary <ul><li>Manual </li></ul><ul><li>Filtering </li></ul><ul><li>Summarization and reports </li></ul><ul><li>Log searching </li></ul><ul><li>Correlation </li></ul>
    9. 9. Log Analysis Basics: Manual <ul><li>Manual log review </li></ul><ul><ul><li>Just fire your trusty ‘tail’, ‘more’, “notepad”, ‘vi’, Event Viewer, etc and get to it!  </li></ul></ul><ul><li>Pros: </li></ul><ul><ul><ul><li>Easy, not tools required (neither build nor buy) </li></ul></ul></ul><ul><li>Cons: </li></ul><ul><ul><ul><li>Try it with 10GB log file one day  </li></ul></ul></ul><ul><ul><ul><li>Boring as Hell!  </li></ul></ul></ul>
    10. 10. Log Analysis Basics: Filtering <ul><li>Log Filtering </li></ul><ul><ul><li>Just show me the bad stuff; here is the list (positive) </li></ul></ul><ul><ul><li>Just ignore the good stuff; here is the list (negative or “Artificial Ignorance”) </li></ul></ul><ul><li>Pros: </li></ul><ul><ul><li>Easy result interpretation: see->act </li></ul></ul><ul><ul><li>Many tools or write your own </li></ul></ul><ul><li>Cons: </li></ul><ul><ul><li>Patterns beyond single messages? </li></ul></ul><ul><ul><li>Neither good nor bad, but interesting? </li></ul></ul>
    11. 11. Log Analysis Basics: Summary <ul><li>Summarization and reports </li></ul><ul><ul><li>Top X Users, Connections by IP, </li></ul></ul><ul><li>Pros: </li></ul><ul><ul><li>Dramatically reduces the size of data </li></ul></ul><ul><ul><li>Suitable for high-level reporting </li></ul></ul><ul><li>Cons: </li></ul><ul><ul><li>Loss of information by summarizing </li></ul></ul><ul><ul><li>Which report to pick for a task? </li></ul></ul>
    12. 12. Log Analysis Basics: Search <ul><li>Search across stored logs </li></ul><ul><li>User specifies a time period, log source(s), and an expression; gets back logs that match (regex or index keywords) </li></ul><ul><li>Pro </li></ul><ul><ul><li>Easy to understand </li></ul></ul><ul><ul><li>Quick to do </li></ul></ul><ul><li>Con </li></ul><ul><ul><li>What do you search for? </li></ul></ul><ul><ul><li>A LOT of data back, sometimes </li></ul></ul>
    13. 13. Log Analysis Basics: Correlation <ul><li>Correlation </li></ul><ul><ul><li>Rule-based and other “correlation” and “Correlation” algorithms </li></ul></ul><ul><li>Pro </li></ul><ul><ul><li>Promise of automated analysis </li></ul></ul><ul><li>Con </li></ul><ul><ul><li>Needs rules written by experts </li></ul></ul><ul><ul><li>Often, needs to be operated by experts too </li></ul></ul><ul><ul><li>Needs tuning for each site </li></ul></ul>
    14. 14. <ul><li>Happy Now? </li></ul><ul><li>Most People Aren’t </li></ul>C O M P A N Y C O N F I D E N T I A L
    15. 15. Log Management “Grand Challenges?” <ul><li>Given “the state of the art” of log analysis, major problems remain. </li></ul><ul><li>Logging Grand Challenges </li></ul><ul><li>Log management BIG and unsolved problems that cause major pain! </li></ul><ul><li>Problems that people tried to solve – and FAILED! </li></ul>
    16. 16. GC1 – Log Chaos <ul><li>Challenge </li></ul><ul><ul><li>Logs come in a dizzying variety of formats, they look and mean different – how do we understand them? Some of them are just “bad!” </li></ul></ul><ul><li>Why a grand challenge? </li></ul><ul><ul><li>Lack of log standards make log analysis unreliable and complex art </li></ul></ul><ul><li>Current approaches? </li></ul><ul><ul><li>Take logs one by one; write regexes or index </li></ul></ul><ul><li>Why still a challenge? </li></ul><ul><ul><li>No credible log standard emerged (work ongoing) </li></ul></ul>
    17. 17. Example Log Chaos - Login? <122> Mar 4 09:23:15 localhost sshd[27577]: Accepted password for kyle from ::ffff: port 2895 ssh2 <13> Fri Mar 17 14:29:38 2006 680 Security SYSTEM User Failure Audit ENTERPRISE Account Logon Logon attempt by: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0     Logon account :  POWERUSER    <57> Dec 25 00:04:32:%SEC_LOGIN-5-LOGIN_SUCCESS: Login Success [user:yellowdog] [Source:] [localport:23] at 20:55:40 UTC Fri Feb 28 2006 <18> Dec 17 15:45:57 ns5xp: NetScreen device_id=ns5xp system-warning-00515: Admin User netscreen has logged on via Telnet from (2002-12-17 15:50:53)
    18. 18. GC2 – Secure and Reliable Log Collection <ul><li>Challenge </li></ul><ul><ul><li>To collect the logs securely, reliably AND without heavy management overhead and complexity of access </li></ul></ul><ul><li>Why a grand challenge? </li></ul><ul><ul><li>Agents vs remote grabbing vs stream: all suck. Security and reliability cost major management overhead </li></ul></ul><ul><li>Current approaches? </li></ul><ul><ul><li>Agents + remote grab (administrator access) + “silly stream” (syslog UDP) </li></ul></ul><ul><li>Why still a challenge? </li></ul><ul><ul><li>All approaches have critical drawbacks </li></ul></ul>
    19. 19. GC3 - Log Parsing and Regexs <ul><li>Challenge </li></ul><ul><ul><li>To turn logs into information, one needs to parse them; to parse them one needs regular expressions (regex) </li></ul></ul><ul><li>Why a grand challenge? </li></ul><ul><ul><li>Every log type requires hand-writing a set of regexes </li></ul></ul><ul><li>Current approaches? </li></ul><ul><ul><li>UIs, “semi-auto”/assisted regex creators, limited auto-extraction, choosing not to parse, etc </li></ul></ul><ul><li>Why still a challenge? </li></ul><ul><ul><li>Despite all tools, log expert must create the rules </li></ul></ul>
    20. 20. GC4 – Automated Meaning Extraction <ul><li>Challenge </li></ul><ul><ul><li>Automatically analyze logs and gain useful information, across domains (security, ops, compliance) </li></ul></ul><ul><li>Why a grand challenge? </li></ul><ul><ul><li>Log analysis is heavily manual, interpretative and domain- and system-specific </li></ul></ul><ul><li>Current approaches? </li></ul><ul><ul><li>Rule-based, summarization, filtering, minimum anomaly detection </li></ul></ul><ul><li>Why still a challenge? </li></ul><ul><ul><li>“ Log analysis is an art, not science” -> not much automation </li></ul></ul>
    21. 21. GC5 – “Fuzzy” Search <ul><li>Challenge </li></ul><ul><ul><li>How to find the “right” log message(s) without knowing what to look for, exactly? </li></ul></ul><ul><li>Why a grand challenge? </li></ul><ul><ul><li>Many uses of logs require searching but users often don’t know what to look for </li></ul></ul><ul><li>Current approaches? </li></ul><ul><ul><li>Trying keywords + wildcards + refining search as we go </li></ul></ul><ul><li>Why still a challenge? </li></ul><ul><ul><li>No method to incorporate uncertainty in search is found yet </li></ul></ul>
    22. 22. <ul><li>What Else Can We Try? </li></ul>C O M P A N Y C O N F I D E N T I A L
    23. 23. Handling The Challenges <ul><li>Data mining </li></ul><ul><li>“ Search+” or smart search </li></ul><ul><li>Text mining (or “search++”) </li></ul><ul><li>Bayesian analysis </li></ul><ul><li>Context enrichment </li></ul><ul><li>Visualization </li></ul>C O M P A N Y C O N F I D E N T I A L
    24. 24. Log Mining <ul><li>Why “mine the logs ”? </li></ul><ul><ul><li>More human-like pattern recognition </li></ul></ul><ul><ul><li>Dealing with sparse data </li></ul></ul><ul><ul><li>Prediction ? Probably not (not soon!) </li></ul></ul><ul><li>Towards “replacing” humans ( trying to…) </li></ul><ul><ul><li>Offloading conclusion generation to machines </li></ul></ul><ul><ul><li>“ Better than junior analysts” </li></ul></ul>
    25. 25. Preliminary DATA Requirements <ul><li>Mostly the same as for other log analysis, but with some added factors: </li></ul><ul><li>Centralized </li></ul><ul><ul><li>To look in just one place </li></ul></ul><ul><li>Normalized </li></ul><ul><ul><li>To look across the data sources </li></ul></ul><ul><li>Quick accessible storage </li></ul><ul><ul><li>To be used by the mining tools </li></ul></ul>
    26. 26. What Do We “Mine” for? <ul><li>How about for something interesting ? </li></ul><ul><li>One research paper defines “interesting” thus: </li></ul><ul><ul><li>Unexpected to user (aka not “normal”, not routine) </li></ul></ul><ul><ul><li>Actionable (we can and/or should do something about it) </li></ul></ul>
    27. 27. Simple Example <ul><li>Too many attack types from a single IP address </li></ul><ul><li>Right next to known vulnerability scanners </li></ul><ul><li>External IP address </li></ul><ul><li>Conclusion : potentially dangerous attacker </li></ul>
    28. 28. Deeper into interesting - I <ul><li>Approaches to finding interesting stuff in logs without knowing what we look for specifically : </li></ul><ul><li>Rare things </li></ul><ul><ul><li>Is compromise rare in your environment?  </li></ul></ul><ul><li>Different things ( NEW , GONE , etc) </li></ul><ul><ul><li>Is today “just another day” … or not? </li></ul></ul><ul><li>“ Out of character ” things </li></ul><ul><ul><li>It always does it… but not today? </li></ul></ul>
    29. 29. Can You Guess What Happened?!
    30. 30. Deeper into interesting - II <ul><li>Counts of an otherwise uninteresting thing </li></ul><ul><ul><li>Pings? Connections to port 80? Error 404s? </li></ul></ul><ul><li>Ratios of otherwise uninteresting things </li></ul><ul><ul><li>Login failures / login successes? </li></ul></ul><ul><ul><li>Inbound / outbound connections? </li></ul></ul><ul><li>Frequencies of things </li></ul><ul><ul><li>Frequent becoming rare – and vice versa! </li></ul></ul><ul><li>Time series behaving badly </li></ul><ul><ul><li>Traffic overall grows, but traffic vs system X slows </li></ul></ul>
    31. 31. Where Is The DATA? <ul><li>But are logs really data ?  Looks like [ broken ] English to me… </li></ul><ul><li>%PIX-2-214001: Terminating manager session from on interface inside. Reason: incoming encrypted data (18998 bytes) longer than 12453 bytes </li></ul><ul><li>%PIX-3-109016: Downloaded authorization access-list 101 not found for user sunilp </li></ul><ul><li>So, is DATA mining appropriate? </li></ul>
    32. 32. “ Search+” <ul><li>Workflows to tune searches </li></ul><ul><li>Advanced search syntax </li></ul><ul><li>Search data presentation and visualization </li></ul><ul><li>Statistics on search results </li></ul><ul><li>Automated text-> data conversion (assisted “parsing”) </li></ul>C O M P A N Y C O N F I D E N T I A L
    33. 33. Text Mining Logs? <ul><li>Text clustering </li></ul><ul><li>Baselining and profiling of text streams </li></ul><ul><li>Bayesian mining </li></ul><ul><li>Example : textalog tool </li></ul><ul><li>New/gone keywords, keyword pairs or phrases </li></ul><ul><li>Changes in keyword frequency and mixture </li></ul><ul><li>New/gone text clusters </li></ul>C O M P A N Y C O N F I D E N T I A L
    34. 34. Keywords to Meaning? <ul><ul><li>E.g. Change Management activity </li></ul></ul><ul><ul><ul><li>Keywords: change*, modif*, add *, delete*, drop, remove*, creat*, restore*, set, clear*, enable*, install*, write*, rename*, alter*, truncate*, renam*, update*, erase* </li></ul></ul></ul><ul><ul><ul><li>Keyword pairs: policy + change*, object + change*, file + write*, account + adde   *, audit* + change*, creat* + ( user OR account ), new + user OR account, change OR modify + restart, config*+ erase*, user+add*, config*+writ*, config* + chang*, account + change* </li></ul></ul></ul>C O M P A N Y C O N F I D E N T I A L
    35. 35. Log Visualization <ul><ul><li>Log Data -> Pictures </li></ul></ul><ul><ul><ul><li>But can you do “anything but the scan?”  </li></ul></ul></ul><ul><ul><li>Data visualization is MUCH easier then text visualization </li></ul></ul><ul><ul><li>Example : AfterGlow tool </li></ul></ul>
    36. 36. Log Context Enrichment? C O M P A N Y C O N F I D E N T I A L
    37. 37. Finally, Can We Make Better Logs? CEE = Syntax + Vocabulary + Transport + Log Recommendations <ul><li>Common Event Expression Impacts </li></ul><ul><ul><li>Log management capabilities </li></ul></ul><ul><ul><li>Log correlation (SIEM) capabilities </li></ul></ul><ul><ul><li>Device intercommunication enabling autonomic computing </li></ul></ul><ul><ul><li>Enterprise-level situational awareness </li></ul></ul><ul><ul><li>Infosec attacker modeling and other security analysis capability </li></ul></ul><ul><li>Common Event Expression Taxonomy </li></ul><ul><ul><li>To specify the event in a common representation </li></ul></ul><ul><li>Common Log Syntax </li></ul><ul><ul><li>For parsing out relevant data from received log messages </li></ul></ul><ul><li>Common Log Transport </li></ul><ul><ul><li>For exchanging log messages </li></ul></ul><ul><li>Log Recommendations </li></ul><ul><ul><li>For guiding events and details needed to be logged by devices (OS, IDS, FWs, etc) </li></ul></ul>
    38. 38. Conclusions and CALL TO ACTION! <ul><li>As we collect more logs, the issue of “what to do with them?” will come up in force, FINALLY! </li></ul><ul><li>Current methods of turning logs into useful info mostly suck (or “waaaaaaaaaay too hard”) </li></ul><ul><li>Promising others methods are KNOWN! </li></ul><ul><li>What more do you need?  </li></ul><ul><li>Get to work on these big logging problems! </li></ul>C O M P A N Y C O N F I D E N T I A L
    39. 39. Thank You! <ul><li>Dr. Anton Chuvakin </li></ul><ul><li>“ Log Addict” </li></ul><ul><li> </li></ul><ul><li>See for my papers, books, reviews and other security and logging resources. </li></ul><ul><li>Subscribe to my blog at </li></ul>
    40. 40. Backup and Reference Slides C O M P A N Y C O N F I D E N T I A L
    41. 41. “ Top 11 Reasons People Hate Logs” <ul><ul><li>Read any logs lately? Got bored in 5 minutes - or survived for the whopping 10? Congrats, you score a point! But logs are still boooooooooooooooooooooooooooooring . </li></ul></ul><ul><ul><li>One log, two logs, 10 logs.... 1,000,000,000 logs: rabbits and hamsters cannot match the speed with which logs multiply . Don't you just hate that? </li></ul></ul><ul><ul><li>You keep hearing people refer to &quot;log data.&quot; Then  you run 'tail /var/log/messages' and see text in pidgin English. Where is my data ? Hate it! </li></ul></ul><ul><ul><li>&quot;Real hackers don't get logged &quot;: thus logs are seen as useless - and hated by some &quot;hard core&quot; security pros! </li></ul></ul><ul><ul><li>If people lie to you, you hate it. Logs do lie too (see 'false positives') - and they are hated too. </li></ul></ul><ul><ul><li>'Transport error 202 message repeated 3456 times.' Niiiiice. Now go fix that! Fix what? Ah, hate the log obscurity ! </li></ul></ul><ul><ul><li>Why are there 47 different ways to log that &quot;connection from A to B was established OK?&quot; Or 21 way to say &quot;user logged in OK?&quot; No, really? Why? Who can I kill to stop this insanity? </li></ul></ul><ul><ul><li>You MUST do XYZ with logs for compliance . Or you are going to jail, buddy! No, sorry, we can't tell you what XYZ is. Maybe in 7 years; for now, just store everything. </li></ul></ul><ul><ul><li>'Critical error: process completed successfully'  and 'Operation successfully failed' engender deep and lasting hatred of logs in most people. They just do ... </li></ul></ul><ul><ul><li>The book called &quot; Ugliest Logs Ever !&quot; is a fat tome, covering every log source from a Linux system all the way to databases and CRM. Bad logs are popular! Bad logs are all the rage among the programmers! Bad logs are here to stay. Bad logs that mean nothing power the log hatred. </li></ul></ul><ul><ul><li>&quot;Logs: can't live with them, can't live without them&quot; :-) Hate them we might for different reasons, but we still must collect , protect ,  review , and analyze them ... </li></ul></ul>