Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Mining with Splunk

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 66 Ad
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Data Mining with Splunk (20)

Advertisement

Data Mining with Splunk

  1. 1. Data Mining and Exploration David Carasso, Office of CTO, Chief Mind
  2. 2. AGENDA What is data mining? What’s the plan of attack? What type of events do I have? How do I mine fields? How do I to detect anomalous events? Why do I need to visualize my data?
  3. 3. What is Data Mining? 3
  4. 4. Is this data mining? This is an orange 4
  5. 5. What is Data Mining? Extracting implicit, previously unknown, and potentially useful information from data. 5
  6. 6. Better 6
  7. 7. Data Preparation Understanding Data Exploration Data Mining 7
  8. 8. What’s the plan of attack? 8
  9. 9. Preparing the data You've been thrown data you aren't familiar with… Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0) Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user root Mar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user root Mar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user 'root' Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config... Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”… Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0) Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root .... Eventtypes Fields Transactions Anomalies (closed sessions) (pid) (open-close) (unexpected address) 9
  10. 10. Is Understanding Linear? Event Groups Events reports Anomalies Fields No. 10
  11. 11. What type of events do I have? 11
  12. 12. Given Some Unknown Data Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0) Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user root Mar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user root Mar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user 'root' Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config... Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”… Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration ... Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0) Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root .... 12
  13. 13. Find Broad Categories of Events Group Events by Content, Format, and Time 13
  14. 14. Group Events by Content Cluster events with similar values. Show 3 examples from each cluster, from the most common cluster to the least: …| cluster labelonly=t showcount=t | dedup 3 cluster_label sortby -cluster_count, cluster_label, _time 14
  15. 15. Events By Content count label _raw -------------------------------------------------------------------------------------------------------- - 1339 3 Mar 7 11:05:01 willLaptop crond(pam_unix)[6785]: session opened for user root by… 1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1769]: session opened for user root by … 1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session opened for user root by … 1324 2 Mar 7 11:05:02 willLaptop crond(pam_unix)[6785]: session closed for user root 1324 2 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session closed for user root 1324 2 Mar 7 11:10:02 willLaptop crond(pam_unix)[1769]: session closed for user root 136 13 Mar 7 20:05:08 willLaptop kernel: SELinux: initialized (dev selinuxfs, type selinuxfs)… 136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev usbfs, type usbfs), uses … 136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev sysfs, type sysfs), uses … 15
  16. 16. Group by $%#! Format Cluster events by first 7 punctuation chars: …| rex field=punct "(?<smallpunct>.{7})” | eventstats count by smallpunct | sort -count, smallpunct | dedup 3 smallpunct 16
  17. 17. Events by Format count smallpunct raw ------------------------------------------------------------------------------------------------ 637 __::__( Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root 637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root 637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by … 367 __::__: Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds. 367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50 367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67 57 __::__[ Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126, stratum 2 57 __::__[ Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum 10 57 __::__[ Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567 s 17
  18. 18. Group by Time Look for bursts of events • Turn on computer • Load a web page • Detects speeding car • Print document • Scan security badge 18
  19. 19. Group by Time Bursts … | transaction maxpause=2s | search eventcount>1 Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session opened for user root by (uid=0) Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by (uid=0) Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67 Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50 Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds. Mar 10 16:45:01 willLaptop crond(pam_unix)[9553]: session opened for user root by (uid=0) Mar 10 16:45:02 willLaptop crond(pam_unix)[9553]: session closed for user root 19
  20. 20. Multiple Sources (not really correct) 20
  21. 21. Now what? 1. ✓ group your data 2. tell splunk! 21
  22. 22. Telling Splunk (about your groups of events) Add eventtypes and tags Huh? 22
  23. 23. SURPRISE TANGENT! What is an eventtype? 23
  24. 24. Eventtype A dynamic “tag” added to events, if they would match the search that defines the eventtype. 24
  25. 25. Eventtype: Name: “closed_root” Definition: “session closed” root Event: … session closed for user root … => eventtype=closed_root 25
  26. 26. Create an Eventtype 26
  27. 27. Independent searches will return events tagged with previous eventtypes that help classify events. 27
  28. 28. Create reports on the classifications you’ve made Ok, it wasn’t a tangent. 28
  29. 29. How do I mine fields? 29
  30. 30. Fields Correlation Discover correlations to remove uninteresting fields and narrow in on promising reports. haiku 30
  31. 31. Fields Correlation Haiku Discover patterns in fields with a correlation: co-occurring fields. indulgence 31
  32. 32. Splunkd.log Sample File 09-05-2012 15:34:11.886 -0700 INFO ExecProcessor - Ran script: python /opt/splunk/etc/apps/... 09-05-2012 15:34:02.467 -0700 ERROR TcpOutputProc - Can't find or illegal IP address or ... 09-05-2012 15:32:03.397 -0700 INFO ProcessTracker - Process ran long; type=SplunkOptimize ... 09-05-2012 15:30:20.016 -0700 WARN DispatchCommand - The system is approaching the maximum ... fascinating 32
  33. 33. Field Correlation … | correlate RowField C CN Component Context L ... ------------------------ ---- ---- --------- ------- ---- C 1.00 1.00 0.00 0.00 1.00 CN 1.00 1.00 0.00 0.00 1.00 Component 0.00 0.00 1.00 0.06 0.00 Context 0.00 0.00 0.06 1.00 0.00 L 1.00 1.00 0.00 0.00 1.00 Log_Level 0.00 0.00 1.00 0.06 0.00 … 33
  34. 34. Field Associations automatically deduce correlations and implications of field values: …| associate Log_Level Component 34
  35. 35. Field Association Summary Uncond Cond Ref_Key Ref_Value Target_Key Support Entropy Entropy Increase Top_Conditional_Value --------- ------------------------ ---------- ------- ------- ------- -------- ------------------------ Component DatabaseDirectoryManager Log_Level 34.67% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%) Component HotDBManager Log_Level 38.25% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%) Component SavedSplunker Log_Level 394.31% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%) Component databasePartitionPolicy Log_Level 95.50% 1.182 0.417 0.765017 INFO (33.15% -> 91.57%) Component loader Log_Level 79.17% 1.182 0.050 1.131883 INFO (33.15% -> 99.44%) Component timeinvertedIndex Log_Level 44.28% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%) 35
  36. 36. Top Fields by Fields Most common Log_Level by Component: ... | top Log_Level by Component Component Log_Level count percent ---------------------------------- --------- ----- ---------- AdminManager WARN 1 100.000000 DatabaseDirectoryManager WARN 153 100.000000 DateParserVerbose WARN 262 100.000000 DedupProcessor ERROR 1 100.000000 DeploymentClient DEBUG 60 85.714286 DeploymentClient WARN 5 7.142857 36
  37. 37. How do I to detect anomalous events? 37
  38. 38. Types of Anomalies Anomalies you know about Anomalies you don’t know about 38
  39. 39. Handling Known Anomalies. Easy. Define a search for the anomalous condition and make an alert to detect it. ip=10.* NOT domain=mycompany.com … | stats perc99(spent)  500ms. Alert on “spent>500” 39
  40. 40. Finding Unknown Anomalies Look for Abnormal • Single-Field Values • Multi-Field Values • Contexts • Visual Inspections… 40
  41. 41. Anomalies by Single Field Values Identify anomalous values in a given field either by frequency of occurrence or number of standard deviations from the mean. … | anomalousvalue action=summary pthresh=0.02 | search isNum=YES 41
  42. 42. Anomalies by Single Field Values 42
  43. 43. Anomalous by Many Values Look for small clusters – by content, format, and time – to find anomalies. For example… …| cluster …| sort cluster_count 43
  44. 44. Smallest Clusters by Content count label uri 1 7 /img/skins/default/bolt.png 1 37 /en-US/search/inspector?sid=1345075042.125&namespace=search 1 45 /services/admin/summarization?count=10 1 53 /services/pdfgen/is_available?viewId=index_status_health&... 1 57 /static/splunkrc_cmds.xml 44
  45. 45. Small Clusters: Bursts of One Find bursts of just a single events where a pause of 2 seconds occurred around it. … |transaction maxpause=2s | search eventcount = 1 Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126… Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum… Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567… 45
  46. 46. Burst of One Same idea, different data source: splunk [11:58:08] "POST /services/search/jobs/export HTTP/1.1" 200 201630 … [11:12:51] "POST /services/search/jobs/export HTTP/1.1" 200 459441 … [10:00:58] "GET /servicesNS/nobody/SplunkDeploymentMonitor/backfill/… 46
  47. 47. Anomalous by Context Identify values not expected by the context of other events. … | anomalies field=file labelonly=true maxvalues=10 47
  48. 48. Anomalous by Context Unexpectedness file 0.00 shelper 0.16 shelper 0.00 1345502591.356 0.00 1345502591.356 0.00 1345074401.191 0.00 1345074031.153 time 0.03 1345074328.186 0.00 1345502591.356 0.35 conf-dm_backfill 0.00 1345074309.185 0.00 1345502591.356 48
  49. 49. Surprise Eventtype: Part Deux! Classified major categories of your data with eventtypes? -- just search for things that don’t match those eventtypes 49
  50. 50. 50
  51. 51. Once you can describe anomalous behavior as a search… 51
  52. 52. 52
  53. 53. Other mining commands • kmeans: Performs k-means clustering on selected fields. • outlier: Removes outlying numerical values. • af (analyze fields): Analyzes numerical fields for their ability to predict another discrete field • fieldsummary : Generates summary information fields. • shape: Produces a symbolic 'shape' attribute describing the shape of a numeric multivalued field 53
  54. 54. Why do I need to visualize my data? 54
  55. 55. Data Mining by Visualization Visualization can capture nuances in the data that numerical or linguistic summaries cannot easily capture. 55
  56. 56. These data points are radically different. *Source: Anscombe’s Quartet (Anscombe 1973) 56
  57. 57. Why visualize? Because they all have the exact same • average (7.50) • standard deviation (2.03) • least-squares fit (3 + 0.5x). Do not just rely on numerical summarization. 57
  58. 58. But I already have charts! You don’t graph enough. Data Exploration Don’t decide ahead of time what graphs you want Regularly do out-of-the-box scenarios with graphs 58
  59. 59. Data Exploration Variations: • Subsets of Events (paying customers vs lookers) • Fields by Fields (including eventtypes and tags) • Ignored fields • Min/max/avg/count • Compare to other times windows • Transactions 59
  60. 60. Visual Arrangement Sorting data, Changing Scales (Linear/Log), Min/Max can have a huge difference on looking at the same data. 60
  61. 61. Visual Considerations Pick representations that make obvious the distinctions you need to care about. 61
  62. 62. Summary 62
  63. 63. Summary • Discovery is an iterative process. • Group events by content, format, and time, and define classifications with eventtypes and tags • Focus on promising fields with correlations • Discover unknown anomalies with small clusters. • Visualize your data, from a dozen angles. 63
  64. 64. But wait! 64
  65. 65. More to come: Predictive Analytics … | forecast foo 65
  66. 66. The End Mine the Gap. .,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`... .,`......_.,`...,`...,`...,`...,`...,`...,`...,`...,`....._.. ...___..|.|...__._..._.__.,`..._.__.,`..___...__.,`...__.|.|. ../.__|.|.|../._`.|.|.'_.....|.'_..../._...../././.|.|. .|.(__..|.|.|.(_|.|.|.|_).|...|.|.|.|.|.(_).|...V..V./..|_|. ..___|.|_|..__,_|.|..__/....|_|.|_|..___/...._/_/...(_). .,`...,`...,`...,`..|_|.,`...,`...,`...,`...,`...,`...,`..... Golf clapping at #datamining .,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`... 66

Editor's Notes

  • ----- Meeting Notes (9/7/12 14:21) -----[ASK AUDIENCE -- WHAT IS DATA MINING?]
  • No. Explicit. Learning nothing new. Not significant in meaning.I’m explicitly telling you what it is. You’re not mining it. By looking at the data, you’re not learning anything new by me saying this is an orange. And frankly it’s not useful.
  • Regularities, patterns, anomalies that are interesting, meaning not obvious, explicit inferences, and at the same time not coincidental or noisy inferences.
  • Yellow is SodaBlue is PopRed is Coke
  • Before we can really mine a bunch of text for valuable information, we need to do some prep work. We need to understand our data – the dimensions, the sets of values. In Splunk terms – create fields, eventtypes, transactions, etc.By adding fields, you’re mining out dimensions; by adding eventtypes, you’re mining classes; my adding transactions, you’re mining correlations; etc.BUT… Prepping the data for mining is a data mining task of sorts in itself, and the line between understanding your data and mining is really non-existent. This before-work is sometimes called Data Exploration.
  • The more knowledge you can add to Splunk about your data the more options you’ll have to analyze it.There maybe data cleaning involved.
  • You can go from groups of events to understanding events to understanding fields to understanding normality/anomalies to generating reports. But the truth is, this is an iterative process. Each step tells you more about something else. (Un)fortunately, this presentation is linear.
  • Raw values, like raw text.
  • Make eventtypes for “session opened”, “session closed”, “linux initialized”. Tag them. Then mine out questions like “how long is the average session?, “how much churn is there?”, etc
  • Consider linecount as well.
  • Make eventtypes or tags for cron jobs, ntpd, dhclient. Then mine out questions like “who is running what jobs? Which are the most common?
  • One of the most useful ways to see how your individual events relate to each other is to look for pauses in your events, as real-physical events often happen in bursts. For example, there are bursts of log activity:When you shutdown a computerWhen you access a web page, which has many images.When a car factory robot detects the next carWhen you turn on a printer and it connects to your computerWhen you scan your security badge
  • Make transactions for sessions opening and closing. Find unclosed transactions. How often, how many, by whom?
  • No reason to limit correlations to a particular data source. Splunk can easily correlate them together in one search.Search isn’t correct in that the dedup is removing important consecutive events, but it was useful for showing small correlated events across sources.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.When you search your data, you’re essentially weeding out all unwanted events; the results of your search are events that share common characteristics, and you can give them a collective name or “event type”. The names of your event types are added as values into an eventtype field. This means that you can search for, and report on, these groups of events the same way you search for any field. The following example takes you through the steps to save a search as an eventtype and then searching for that field. If you run frequent searches to investigate SSH and firewall activities, such as sshd logins or firewall denies, you can save these searches as an event type. Also, if you see error messages that are cryptic, you can save it as an event type with a more descriptive name.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.When you search your data, you’re essentially weeding out all unwanted events; the results of your search are events that share common characteristics, and you can give them a collective name or “event type”. The names of your event types are added as values into an eventtype field. This means that you can search for, and report on, these groups of events the same way you search for any field. The following example takes you through the steps to save a search as an eventtype and then searching for that field. If you run frequent searches to investigate SSH and firewall activities, such as sshd logins or firewall denies, you can save these searches as an event type. Also, if you see error messages that are cryptic, you can save it as an event type with a more descriptive name.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
  • Why? Reduce the number of fields you should focus on to those with the most value. For analysis and graphing
  • Why? Reduce the number of fields you should focus on to those with the most value. For analysis and graphing
  • A 1.0 means two fields always co-occur. For example, Component and Log_Level always co-occur in splunkd.log. You can filter out fields to make this table more manageable.
  • ----- Meeting Notes (9/4/12 11:49) -----give splunkd example output first to show log
  • This shows that before we know the component is SavedSplunker, the odds of a WARN Log_Level is 62.25%; afterwords, the odds are 100%. Before we know the component is loader, the odds of INFO Log_Level is 33.15%; afterwards, 99.44%.
  • What are anomalies/outliers?The set of data points that are considerably differentApplications: network intrusion detection, fault detection, credit card fraud detection, telecommunication fraud detection– Build a profile of the “normal” behavior – patterns, stats to detect anomaliesVery often you want to find “problems” in your IT data, but you don’t know what to look for. If you know what to look for, by all means, look.
  • Very often you want to find “problems” in your IT data, but you don’t know what to look for. If you know what to look for, by all means, look.… | eventstats perc99(spent) as bigspender | where spent &gt; bigspender
  • Very often you want to find anomalies/problems in your IT data, but you don’t know what to look for. Single Value: – ‘port’ value is highly irregularMany Values: – many values look different than othersAnomalous: – many values were unexpected by contextEvernything applies to transactions as well. Look for anomalies
  • Identifies values in the data that are anomalous either by frequency of occurrence or number of standard deviations from the mean. Make searches to find these anomalous values and create alerts.
  • catNormFreq = the average frequency of non-anomalous valuesisNum means all values of the field were numerical.basically we assume a normal distribution, but if we find that ends up causing too many values to be anomalous we don&apos;t use it
  • Earlier we looked for large clusters to get a broad understanding of the events. We grouped by content, format, and time.Now, just flip it. Make searches to find these anomalous values and create alerts.
  • Same for for form (looking for unusual punctuation) or especially long pauses between events (10 seconds?)Make searches to find these anomalous values and create alerts.
  • . These slow events are often important and indicate longer tasks.
  • Make eventtypes or tags for these slow, important events. Who runs them most? Are they a problem? Why is someone exporting, or backfilling their data? Make an alert when it happens.
  • Experimental search command that uses compression and a window of N last events to see if a new command compresses well with past events, or if it looks unexpected.Make searches to find these anomalous values and create alerts.
  • Make searches to find these anomalous values and create alerts.
  • One of the most obvious and important methods of discovering what your data is saying is to simply graph your data.Humans have a well-developed ability to analyze large amounts of data presented visually, detecting general patterns and trends, as well as outliers and unusual patterns.
  • What data points are outliers? what inferences would you make?radically different.
  • Limitations of Statistical Approaches:   usually tests a single attribute. distributions aren’t known  for many dimensions, hard to estimate the true distribution Do not just rely on numerical summarization, or you won’t see what’s going on.
  • Same for transactions of events, and classes of events (eventtypes) and field-values (tags)
  • Eventually you’ll tweak out little nuggets of knowledge.Over time, what is the average duration users spend on my website by language of country, compared to last month.How does the time on the website correlate with the time of day, or browserdoes the max delay for each server vary over time by languageSame for transactions of events, and classes of events (eventtypes) and field-values (tags)
  • .  So reducing the number of dimensions down to 2 or 3 for visualization and limiting the data shown
  • Heat map vs much more useful chart
  • Discovery: Each step tells you more about everything else.
  • predicting foo and getting better and better at it, and towards the right edge you can see it&apos;s predicting values that haven&apos;t happened yet&quot;

×