Log Mining: Beyond Log Analysis


Published on

The presentation will describe methods for discovering interesting and actionable patterns in log files for security management without specifically knowing what you are looking for. This approach is different from "classic" log analysis and it allows gaining an insight into insider attacks and other advanced intrusions, which are extremely hard to discover with other methods. Specifically, I will demonstrate how data mining can be used as a source of ideas for designing future log analysis techniques, that will help uncover the coming threats. The important part of the presentation will be the demonstration how the above methods worked in a real-life environment.

Published in: Technology

Log Mining: Beyond Log Analysis

  1. 1. Security Log Mining Beyond Log Analysis Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA Security Log Mining Last presented on March 9, 2007 IT Underground Prague, Czech Republic
  2. 2. Goals <ul><li>Learn or refresh your knowledge about log analysis for security </li></ul><ul><li>Learn about novel techniques of log analysis via data mining </li></ul><ul><li>Get you to think of using them in your environment </li></ul>
  3. 3. Outline: Log Mining (LM) <ul><li>Logs and Log Analysis Overview </li></ul><ul><ul><li>What logs?  </li></ul></ul><ul><ul><li>Why analyze logs? </li></ul></ul><ul><ul><li>Why NOT analyze logs?  </li></ul></ul><ul><ul><li>How people usually do it </li></ul></ul><ul><li>Log Mining </li></ul><ul><ul><li>Knowledge discovery and data mining brief </li></ul></ul><ul><ul><li>Mining of different types of logs </li></ul></ul><ul><li>Results </li></ul><ul><ul><li>Examples of using the above methods </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>How one can built tools to do it </li></ul></ul>
  4. 4. Definitions <ul><li>Log = record related to whatever activities occurring on an information system </li></ul><ul><li>Also: alert, “event”, alarm, message, record, etc </li></ul><ul><li>… standard definitions are coming soon!. </li></ul>
  5. 5. Log Analysis: What <ul><li>Log Data Sources </li></ul><ul><ul><li>IDS </li></ul></ul><ul><ul><li>Firewalls/IPS </li></ul></ul><ul><ul><li>Anti-malware </li></ul></ul><ul><ul><li>Proxies </li></ul></ul><ul><ul><li>Network infrastructure </li></ul></ul><ul><ul><li>Servers </li></ul></ul><ul><ul><li>Databases </li></ul></ul><ul><ul><li>Applications </li></ul></ul><ul><li>Log Analysis Process </li></ul><ul><ul><li>Generate </li></ul></ul><ul><ul><li>Collect </li></ul></ul><ul><ul><li>Aggregate </li></ul></ul><ul><ul><li>Normalize </li></ul></ul><ul><ul><li>Alert </li></ul></ul><ul><ul><li>Store </li></ul></ul><ul><ul><li>Summarize, baseline </li></ul></ul><ul><ul><li>Make conclusions </li></ul></ul><ul><ul><li>Act on them! </li></ul></ul>
  6. 6. Log Analysis: Why <ul><li>Situational awareness and new threat discovery </li></ul><ul><ul><li>Unique perspective from combined logs </li></ul></ul><ul><li>Getting more value out of the network and security infrastructures </li></ul><ul><ul><li>Get more that you paid for! </li></ul></ul><ul><li>Extracting what is really actionable automatically </li></ul><ul><li>Measuring security (metrics, trends, etc) </li></ul><ul><li>Compliance and regulations (oh, my!) </li></ul><ul><li>Incident response (last, but not least!) </li></ul>
  7. 7. Log Analysis: Why NOT or Log Analysis Challenges <ul><li>“ Real hackers don’t get logged !”  </li></ul><ul><li>Why bother? No, really … </li></ul><ul><li>Too much data (>X0 GB per day) </li></ul><ul><li>Too hard to do </li></ul><ul><li>No tools “that do it for you” </li></ul><ul><ul><li>Or: tools too expensive </li></ul></ul><ul><li>What logs? We turned them off  </li></ul>
  8. 8. Log Analysis Basics: How <ul><li>Common approaches to the “log problem”: </li></ul><ul><li>Manual </li></ul><ul><ul><li>‘ Tail’, ‘more’, etc </li></ul></ul><ul><li>Filtering </li></ul><ul><ul><li>Positive and negative (“Artificial ignorance”) </li></ul></ul><ul><li>Summarization and reports </li></ul><ul><li>Simple visualization </li></ul><ul><ul><li>“… worth a thousand words?” </li></ul></ul><ul><li>Correlation </li></ul><ul><ul><li>Rule-based and other </li></ul></ul>
  9. 9. Log Analysis Basics: When <ul><li>Timing requirements for analysis </li></ul><ul><li>Real-time fallacy: “we have to have it when?”  </li></ul><ul><ul><li>“ A day later vs never” question </li></ul></ul><ul><ul><ul><li>Would you rather catch an intrusion a day after … or a month after … CNN talks about it  </li></ul></ul></ul><ul><ul><li>Daily in-depth analysis </li></ul></ul><ul><li>Log management vs alert management: different challenges </li></ul><ul><ul><li>When filtering and event correlation is not enough </li></ul></ul><ul><li>Some data just doesn’t mean much in real-time </li></ul>
  10. 10. KDD and DM <ul><li>Introducing data mining.. </li></ul><ul><li>Definitions and background terms: </li></ul><ul><ul><li>Data Mining (DM) and Knowledge Discovery in Database (KDD) </li></ul></ul><ul><li>DM = “Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data ” </li></ul>
  11. 11. Brief on Some DM Techniques <ul><li>From DM to LM: </li></ul><ul><li>Deviation analysis </li></ul><ul><ul><li>Baselines and deviations </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>Organize data by class to know it </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>How things are grouped together </li></ul></ul><ul><li>Association Rule Discovery </li></ul><ul><ul><li>Relationship finding </li></ul></ul><ul><li>Outlier Detection </li></ul><ul><ul><li>What stands out </li></ul></ul>
  12. 12. KDL and LM <ul><li>Log Mining (LM) and Knowledge Discovery in Logs (KDL) </li></ul><ul><li>Is “log mining” a marketing buzzword ? Not yet ! </li></ul><ul><li>Why “mine the logs ”? </li></ul><ul><ul><li>New types of analysis </li></ul></ul><ul><ul><li>More human-like pattern recognition </li></ul></ul><ul><ul><li>Prediction ? Probably not! </li></ul></ul><ul><ul><li>Dealing with sparse data </li></ul></ul><ul><li>Towards “replacing” humans ( not really…) </li></ul><ul><ul><li>Offloading conclusion generation to machines </li></ul></ul><ul><ul><li>“ Better than junior analysts” </li></ul></ul>
  13. 13. Preliminary Requirements <ul><li>Mostly the same as for simpler log analysis, but with some added factors: </li></ul><ul><li>Centralized </li></ul><ul><ul><li>To look in just one place </li></ul></ul><ul><li>Normalized </li></ul><ul><ul><li>To look across the data sources </li></ul></ul><ul><li>Quick accessible storage </li></ul><ul><ul><li>To be used by the mining tools </li></ul></ul>
  14. 14. Log Data from DM Perspective <ul><li>Common fields in logs: </li></ul><ul><li>Time </li></ul><ul><li>Source </li></ul><ul><li>Destination </li></ul><ul><li>Protocol </li></ul><ul><li>Port(s) </li></ul><ul><li>User name </li></ul><ul><li>Event/attack type </li></ul><ul><li>Bytes exchanged </li></ul>
  15. 15. Log Data from DM Perspective <ul><li>But are logs really data ?  Looks like /broken  / English to me… </li></ul><ul><li>%PIX-2-214001: Terminating manager session from on interface inside. Reason: incoming encrypted data (18998 bytes) longer than 12453 bytes </li></ul><ul><li>%PIX-3-109016: Downloaded authorization access-list 101 not found for user sunilp </li></ul><ul><li>Text mining techniques might also come handy </li></ul>
  16. 16. Example: Jumbled Mess of SAP Application Logs <ul><li>|22:01:40|BTC| 7|000|DDIC | |LC2|Systemerror when executing external command DB6_DATA_COLLECTOR on gneisenau () </li></ul><ul><li>|22:02:32|BTC| 7|000|DDIC | |R49|Communication error, CPIC return code 020, SAP return code 456 </li></ul><ul><li>|22:02:32|BTC| 7|000|DDIC | |R5A|> Conversation ID: 38910614 </li></ul><ul><li>|22:02:32|BTC| 7|000|DDIC | |R64|> CPI-C function: CMSEND(SAP) </li></ul><ul><li>|22:02:32|BTC| 7|000|DDIC | |LC2|Systemerror when executing external command DB6_DATA_COLLECTOR on gneisenau () </li></ul>
  17. 17. What Do We “Mine” for? <ul><li>How about for something interesting ? </li></ul><ul><li>One research paper defines “interesting” thus: </li></ul><ul><ul><li>Unexpected to user (aka not “normal”, not routine) </li></ul></ul><ul><ul><li>Actionable (we can and/or should do something about it) </li></ul></ul><ul><li>Examples : </li></ul><ul><ul><li>Compromised/infected system </li></ul></ul><ul><ul><li>Successful attack </li></ul></ul><ul><ul><li>Insider abuse and data theft </li></ul></ul><ul><ul><li>Other data leaks, intentional and not </li></ul></ul><ul><ul><li>Covert channel/hidden backdoor communication </li></ul></ul><ul><ul><li>Increase in probing </li></ul></ul><ul><ul><li>Mysterious system crash </li></ul></ul>
  18. 18. Simple Example <ul><li>Too many attack types from a single IP address </li></ul><ul><li>Right next to known vulnerability scanners </li></ul><ul><li>External IP address </li></ul><ul><li>Conclusion : potentially dangerous attacker </li></ul>
  19. 19. Deeper into interesting - I <ul><li>Approaches to finding interesting stuff in logs without knowing what we look for specifically : </li></ul><ul><li>Rare things </li></ul><ul><ul><li>Is compromise rare in your environment?  </li></ul></ul><ul><li>Different things </li></ul><ul><ul><li>Is today “just another day” … or not? </li></ul></ul><ul><li>“ Out of character ” things </li></ul><ul><ul><li>It always does it… but not today? </li></ul></ul><ul><li>Weird-looking things </li></ul>
  20. 20. Example 1: Can You Guess What Happened?! Destination Port 1D Baseline
  21. 21. Example 2: Can You Guess What Happened?!
  22. 22. Example 3: Can You Guess What Happened?!
  23. 23. Deeper into interesting - II <ul><li>Things goings in the unusual direction </li></ul><ul><ul><li>Your web server is now a web client - to “hack.kz”  </li></ul></ul><ul><li>Top things and Bottom Things </li></ul><ul><ul><li>And them changing places ! </li></ul></ul><ul><li>Strange combinations of uninteresting things </li></ul><ul><ul><li>A nice connection to a web server </li></ul></ul><ul><ul><li>A nice configuration change </li></ul></ul><ul><ul><li>A nice user creation </li></ul></ul><ul><ul><li>SO, is it NICE? </li></ul></ul>
  24. 24. Example 4: Can You Guess What Happened?!
  25. 25. Example 5: Can You Guess What Happened?!
  26. 26. Deeper into interesting - III <ul><li>Counts of an otherwise uninteresting thing </li></ul><ul><ul><li>Pings? Connections to port 80? Error 404s? </li></ul></ul><ul><li>Ratios of otherwise uninteresting things </li></ul><ul><ul><li>Login failures / login successes? </li></ul></ul><ul><ul><li>Inbound / outbound connections? </li></ul></ul><ul><li>Frequencies of things </li></ul><ul><ul><li>Frequent becoming rare – and vice versa! </li></ul></ul><ul><li>Time series behaving badly </li></ul><ul><ul><li>Traffic overall grows, but traffic vs system X slows </li></ul></ul>
  27. 27. Example 6: Can You Guess What Happened?!
  28. 28. More Examples <ul><li>Structure of examples: </li></ul><ul><ul><li>What was discovered? </li></ul></ul><ul><ul><li>What really happened? </li></ul></ul><ul><ul><li>How we discovered “the truth”? </li></ul></ul><ul><li>All examples are from the tools prototyped and tested by the author … </li></ul><ul><ul><li>Deviations and snapshot comparisons for firewall traffic </li></ul></ul><ul><ul><li>Scan detection from firewall data </li></ul></ul><ul><ul><li>Event rarity across system logs </li></ul></ul><ul><ul><li>“ Rich” event sequences in mixed logs </li></ul></ul><ul><ul><li>Ratio analysis for logins and status codes </li></ul></ul><ul><ul><li>Pattern recognition and rule mining </li></ul></ul><ul><ul><li>Local to global trend comparisons in logs </li></ul></ul>
  29. 29. Simple Example Revisited
  30. 30. Example 7: Can You Guess What Happened?!
  31. 31. Example 8: Fun Port Metrics
  32. 32. Example 9: Can You Guess What Happened?!
  33. 33. Real-life Usage <ul><li>A busy analyst comes in the morning…gets coffee </li></ul><ul><li>Remembers that he needs to monitor security in addition to 1,576,903 other tasks  </li></ul><ul><li>Looks at a combination report showing “What is Interesting Today?” </li></ul><ul><li>Investigates some of the items, takes action, etc </li></ul><ul><li>Tells the system not to bother him with the rest in the future </li></ul><ul><li>Goes for more coffee and drowns in the sea of other tasks  </li></ul>
  34. 34. How YOU can do it - I? <ul><li>First , collect logs and events </li></ul><ul><ul><li>Syslog-NG to some SQL </li></ul></ul><ul><ul><li>AANVAL, OSSIM </li></ul></ul><ul><ul><li>ACID/BASE </li></ul></ul><ul><ul><li>Syslog2SQL </li></ul></ul><ul><ul><li>Custom log-to-SQL system (not that hard) </li></ul></ul><ul><ul><li>Whatever SQL log and event store (commercial, open-source, home-grown) </li></ul></ul>
  35. 35. How YOU can do it - II? <ul><li>Second , plan what to baseline </li></ul><ul><ul><li>Network </li></ul></ul><ul><ul><ul><li>Port access, system access, protocols, event types </li></ul></ul></ul><ul><ul><li>System </li></ul></ul><ul><ul><ul><li>Login/logout success/failure, process starts, configuration changes </li></ul></ul></ul><ul><ul><li>Application or database </li></ul></ul><ul><ul><ul><li>Data access type, user, data changes, client, etc </li></ul></ul></ul>
  36. 36. How YOU can do it - III? <ul><li>Third , script the analysis techniques you liked </li></ul><ul><ul><li>Perl with SQL access modules </li></ul></ul><ul><ul><li>Python – a new fave of those who know  ! </li></ul></ul><ul><ul><li>PHP </li></ul></ul>
  37. 37. How YOU can do it - IV? <ul><li>Fourth , act on the results </li></ul><ul><ul><li>Mitigate, block, disable, fire  , slice-n-dice  </li></ul></ul><ul><li>Fifth , automate as needed </li></ul><ul><ul><li>More data, more tools, more results… </li></ul></ul>
  38. 38. Conclusion <ul><li>LM and KDL is… </li></ul><ul><li>… cool and new way of looking at log data </li></ul><ul><li>… actually works </li></ul><ul><li>… can help where common analysis methods fail </li></ul><ul><li>… not that hard  </li></ul><ul><li>… can be done over different kind of data: database logs, application logs, etc </li></ul>
  39. 39. Take These Home with You!! <ul><li>Look at your logs! You’d be happy you started now and not tomorrow (*) </li></ul><ul><li>Simple analysis is incredibly useful, but it only goes so far </li></ul><ul><li>“ Complicated” analysis really isn’t that complicated and can be done “on the cheap” </li></ul>
  40. 40. <ul><li>Thank You for Coming! </li></ul>
  41. 41. Feedback? Q&A? <ul><li>Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA </li></ul><ul><li>Chief Logging Evangelist </li></ul><ul><li>LogLogic, Inc </li></ul><ul><li>[email_address] </li></ul><ul><li>http://www.chuvakin.org </li></ul><ul><li>See www.info-secure.org for my papers, books, reviews </li></ul><ul><li>and other security resources; </li></ul><ul><li>www.securitywarrior.org for “Security Warrior” book (2004) </li></ul>