• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data Mining with Splunk
 

Data Mining with Splunk

on

  • 6,359 views

 

Statistics

Views

Total Views
6,359
Views on SlideShare
6,192
Embed Views
167

Actions

Likes
11
Downloads
194
Comments
0

1 Embed 167

http://www.scoop.it 167

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • ----- Meeting Notes (9/7/12 14:21) -----[ASK AUDIENCE -- WHAT IS DATA MINING?]
  • No. Explicit. Learning nothing new. Not significant in meaning.I’m explicitly telling you what it is. You’re not mining it. By looking at the data, you’re not learning anything new by me saying this is an orange. And frankly it’s not useful.
  • Regularities, patterns, anomalies that are interesting, meaning not obvious, explicit inferences, and at the same time not coincidental or noisy inferences.
  • Yellow is SodaBlue is PopRed is Coke
  • Before we can really mine a bunch of text for valuable information, we need to do some prep work. We need to understand our data – the dimensions, the sets of values. In Splunk terms – create fields, eventtypes, transactions, etc.By adding fields, you’re mining out dimensions; by adding eventtypes, you’re mining classes; my adding transactions, you’re mining correlations; etc.BUT… Prepping the data for mining is a data mining task of sorts in itself, and the line between understanding your data and mining is really non-existent. This before-work is sometimes called Data Exploration.
  • The more knowledge you can add to Splunk about your data the more options you’ll have to analyze it.There maybe data cleaning involved.
  • You can go from groups of events to understanding events to understanding fields to understanding normality/anomalies to generating reports. But the truth is, this is an iterative process. Each step tells you more about something else. (Un)fortunately, this presentation is linear.
  • Raw values, like raw text.
  • Make eventtypes for “session opened”, “session closed”, “linux initialized”. Tag them. Then mine out questions like “how long is the average session?, “how much churn is there?”, etc
  • Consider linecount as well.
  • Make eventtypes or tags for cron jobs, ntpd, dhclient. Then mine out questions like “who is running what jobs? Which are the most common?
  • One of the most useful ways to see how your individual events relate to each other is to look for pauses in your events, as real-physical events often happen in bursts. For example, there are bursts of log activity:When you shutdown a computerWhen you access a web page, which has many images.When a car factory robot detects the next carWhen you turn on a printer and it connects to your computerWhen you scan your security badge
  • Make transactions for sessions opening and closing. Find unclosed transactions. How often, how many, by whom?
  • No reason to limit correlations to a particular data source. Splunk can easily correlate them together in one search.Search isn’t correct in that the dedup is removing important consecutive events, but it was useful for showing small correlated events across sources.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.When you search your data, you’re essentially weeding out all unwanted events; the results of your search are events that share common characteristics, and you can give them a collective name or “event type”. The names of your event types are added as values into an eventtype field. This means that you can search for, and report on, these groups of events the same way you search for any field. The following example takes you through the steps to save a search as an eventtype and then searching for that field. If you run frequent searches to investigate SSH and firewall activities, such as sshd logins or firewall denies, you can save these searches as an event type. Also, if you see error messages that are cryptic, you can save it as an event type with a more descriptive name.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.When you search your data, you’re essentially weeding out all unwanted events; the results of your search are events that share common characteristics, and you can give them a collective name or “event type”. The names of your event types are added as values into an eventtype field. This means that you can search for, and report on, these groups of events the same way you search for any field. The following example takes you through the steps to save a search as an eventtype and then searching for that field. If you run frequent searches to investigate SSH and firewall activities, such as sshd logins or firewall denies, you can save these searches as an event type. Also, if you see error messages that are cryptic, you can save it as an event type with a more descriptive name.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
  • If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
  • Why? Reduce the number of fields you should focus on to those with the most value. For analysis and graphing
  • Why? Reduce the number of fields you should focus on to those with the most value. For analysis and graphing
  • A 1.0 means two fields always co-occur. For example, Component and Log_Level always co-occur in splunkd.log. You can filter out fields to make this table more manageable.
  • ----- Meeting Notes (9/4/12 11:49) -----give splunkd example output first to show log
  • This shows that before we know the component is SavedSplunker, the odds of a WARN Log_Level is 62.25%; afterwords, the odds are 100%. Before we know the component is loader, the odds of INFO Log_Level is 33.15%; afterwards, 99.44%.
  • What are anomalies/outliers?The set of data points that are considerably differentApplications: network intrusion detection, fault detection, credit card fraud detection, telecommunication fraud detection– Build a profile of the “normal” behavior – patterns, stats to detect anomaliesVery often you want to find “problems” in your IT data, but you don’t know what to look for. If you know what to look for, by all means, look.
  • Very often you want to find “problems” in your IT data, but you don’t know what to look for. If you know what to look for, by all means, look.… | eventstats perc99(spent) as bigspender | where spent > bigspender
  • Very often you want to find anomalies/problems in your IT data, but you don’t know what to look for. Single Value: – ‘port’ value is highly irregularMany Values: – many values look different than othersAnomalous: – many values were unexpected by contextEvernything applies to transactions as well. Look for anomalies
  • Identifies values in the data that are anomalous either by frequency of occurrence or number of standard deviations from the mean. Make searches to find these anomalous values and create alerts.
  • catNormFreq = the average frequency of non-anomalous valuesisNum means all values of the field were numerical.basically we assume a normal distribution, but if we find that ends up causing too many values to be anomalous we don't use it
  • Earlier we looked for large clusters to get a broad understanding of the events. We grouped by content, format, and time.Now, just flip it. Make searches to find these anomalous values and create alerts.
  • Same for for form (looking for unusual punctuation) or especially long pauses between events (10 seconds?)Make searches to find these anomalous values and create alerts.
  • . These slow events are often important and indicate longer tasks.
  • Make eventtypes or tags for these slow, important events. Who runs them most? Are they a problem? Why is someone exporting, or backfilling their data? Make an alert when it happens.
  • Experimental search command that uses compression and a window of N last events to see if a new command compresses well with past events, or if it looks unexpected.Make searches to find these anomalous values and create alerts.
  • Make searches to find these anomalous values and create alerts.
  • One of the most obvious and important methods of discovering what your data is saying is to simply graph your data.Humans have a well-developed ability to analyze large amounts of data presented visually, detecting general patterns and trends, as well as outliers and unusual patterns.
  • What data points are outliers? what inferences would you make?radically different.
  • Limitations of Statistical Approaches:   usually tests a single attribute. distributions aren’t known  for many dimensions, hard to estimate the true distribution Do not just rely on numerical summarization, or you won’t see what’s going on.
  • Same for transactions of events, and classes of events (eventtypes) and field-values (tags)
  • Eventually you’ll tweak out little nuggets of knowledge.Over time, what is the average duration users spend on my website by language of country, compared to last month.How does the time on the website correlate with the time of day, or browserdoes the max delay for each server vary over time by languageSame for transactions of events, and classes of events (eventtypes) and field-values (tags)
  • .  So reducing the number of dimensions down to 2 or 3 for visualization and limiting the data shown
  • Heat map vs much more useful chart
  • Discovery: Each step tells you more about everything else.
  • predicting foo and getting better and better at it, and towards the right edge you can see it's predicting values that haven't happened yet"

Data Mining with Splunk Data Mining with Splunk Presentation Transcript

  • Data Mining and Exploration David Carasso, Office of CTO, Chief Mind
  • AGENDAWhat is data mining?What’s the plan of attack?What type of events do I have?How do I mine fields?How do I to detect anomalous events?Why do I need to visualize my data?
  • What is Data Mining? 3
  • Is this data mining?This is an orange 4
  • What is Data Mining?Extracting implicit, previously unknown, andpotentially useful information from data. 5
  • Better 6
  • Data Preparation UnderstandingData ExplorationData Mining 7
  • What’s the plan of attack? 8
  • Preparing the dataYouve been thrown data you arent familiar with…Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user rootMar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user rootMar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 userrootMar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”…Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0)Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root.... Eventtypes Fields Transactions Anomalies (closed sessions) (pid) (open-close) (unexpected address) 9
  • Is Understanding Linear? Event Groups Events reports Anomalies Fields No. 10
  • What type of events do I have? 11
  • Given Some Unknown DataMar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user rootMar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user rootMar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 userrootMar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”…Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address"xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration ...Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0)Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root.... 12
  • Find Broad Categories of EventsGroup Events by Content, Format, and Time 13
  • Group Events by ContentCluster events with similar values.Show 3 examples from each cluster, from the mostcommon cluster to the least:…| cluster labelonly=t showcount=t | dedup 3 cluster_label sortby -cluster_count, cluster_label, _time 14
  • Events By Contentcount label _raw--------------------------------------------------------------------------------------------------------- 1339 3 Mar 7 11:05:01 willLaptop crond(pam_unix)[6785]: session opened for user root by… 1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1769]: session opened for user root by … 1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session opened for user root by … 1324 2 Mar 7 11:05:02 willLaptop crond(pam_unix)[6785]: session closed for user root 1324 2 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session closed for user root 1324 2 Mar 7 11:10:02 willLaptop crond(pam_unix)[1769]: session closed for user root 136 13 Mar 7 20:05:08 willLaptop kernel: SELinux: initialized (dev selinuxfs, typeselinuxfs)… 136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev usbfs, type usbfs), uses … 136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev sysfs, type sysfs), uses … 15
  • Group by $%#! FormatCluster events by first 7 punctuation chars:…| rex field=punct "(?<smallpunct>.{7})” | eventstats count by smallpunct | sort -count, smallpunct | dedup 3 smallpunct 16
  • Events by Formatcount smallpunct raw------------------------------------------------------------------------------------------------ 637 __::__( Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root 637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root 637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by …367 __::__: Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds.367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67 57 __::__[ Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126, stratum 2 57 __::__[ Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum 10 57 __::__[ Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567 s 17
  • Group by TimeLook for bursts of events • Turn on computer • Load a web page • Detects speeding car • Print document • Scan security badge 18
  • Group by Time Bursts… | transaction maxpause=2s | search eventcount>1Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session opened for user root by (uid=0)Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by (uid=0)Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user rootMar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user rootMar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds.Mar 10 16:45:01 willLaptop crond(pam_unix)[9553]: session opened for user root by (uid=0)Mar 10 16:45:02 willLaptop crond(pam_unix)[9553]: session closed for user root 19
  • Multiple Sources (not really correct) 20
  • Now what?1. ✓ group your data2. tell splunk! 21
  • Telling Splunk(about your groups of events)Add eventtypes and tags Huh? 22
  • SURPRISE TANGENT!What is an eventtype? 23
  • EventtypeA dynamic “tag” added to events, if they wouldmatch the search that defines the eventtype. 24
  • Eventtype: Name: “closed_root” Definition: “session closed” rootEvent: … session closed for user root … => eventtype=closed_root 25
  • Create an Eventtype 26
  • Independent searches will return events taggedwith previous eventtypes that help classify events. 27
  • Create reports on the classifications you’ve made Ok, it wasn’t a tangent. 28
  • How do I mine fields? 29
  • Fields CorrelationDiscover correlations to remove uninterestingfields and narrow in on promising reports. haiku 30
  • Fields Correlation HaikuDiscover patternsin fields with a correlation:co-occurring fields. indulgence 31
  • Splunkd.log Sample File09-05-2012 15:34:11.886 -0700 INFO ExecProcessor - Ran script: python /opt/splunk/etc/apps/...09-05-2012 15:34:02.467 -0700 ERROR TcpOutputProc - Cant find or illegal IP address or ...09-05-2012 15:32:03.397 -0700 INFO ProcessTracker - Process ran long; type=SplunkOptimize ...09-05-2012 15:30:20.016 -0700 WARN DispatchCommand - The system is approaching the maximum ... fascinating 32
  • Field Correlation… | correlateRowField C CN Component Context L ...------------------------ ---- ---- --------- ------- ----C 1.00 1.00 0.00 0.00 1.00CN 1.00 1.00 0.00 0.00 1.00Component 0.00 0.00 1.00 0.06 0.00Context 0.00 0.00 0.06 1.00 0.00L 1.00 1.00 0.00 0.00 1.00Log_Level 0.00 0.00 1.00 0.06 0.00… 33
  • Field Associationsautomatically deduce correlations andimplications of field values:…| associate Log_Level Component 34
  • Field Association Summary Uncond CondRef_Key Ref_Value Target_Key Support Entropy Entropy Increase Top_Conditional_Value--------- ------------------------ ---------- ------- ------- ------- -------- ------------------------Component DatabaseDirectoryManager Log_Level 34.67% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%)Component HotDBManager Log_Level 38.25% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%)Component SavedSplunker Log_Level 394.31% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%)Component databasePartitionPolicy Log_Level 95.50% 1.182 0.417 0.765017 INFO (33.15% -> 91.57%)Component loader Log_Level 79.17% 1.182 0.050 1.131883 INFO (33.15% -> 99.44%)Component timeinvertedIndex Log_Level 44.28% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%) 35
  • Top Fields by FieldsMost common Log_Level by Component: ... | top Log_Level by ComponentComponent Log_Level count percent---------------------------------- --------- ----- ----------AdminManager WARN 1 100.000000DatabaseDirectoryManager WARN 153 100.000000DateParserVerbose WARN 262 100.000000DedupProcessor ERROR 1 100.000000DeploymentClient DEBUG 60 85.714286DeploymentClient WARN 5 7.142857 36
  • How do I to detect anomalous events? 37
  • Types of AnomaliesAnomalies you know aboutAnomalies you don’t know about 38
  • Handling Known Anomalies.Easy. Define a search for the anomalous conditionand make an alert to detect it.ip=10.* NOT domain=mycompany.com… | stats perc99(spent)  500ms. Alert on “spent>500” 39
  • Finding Unknown AnomaliesLook for Abnormal• Single-Field Values• Multi-Field Values• Contexts• Visual Inspections… 40
  • Anomalies by Single Field ValuesIdentify anomalous values in a given field either byfrequency of occurrence or number of standarddeviations from the mean.… | anomalousvalue action=summary pthresh=0.02 | search isNum=YES 41
  • Anomalies by Single Field Values 42
  • Anomalous by Many ValuesLook for small clusters – by content, format, andtime – to find anomalies. For example……| cluster …| sort cluster_count 43
  • Smallest Clusters by Contentcount label uri1 7 /img/skins/default/bolt.png1 37 /en-US/search/inspector?sid=1345075042.125&namespace=search1 45 /services/admin/summarization?count=101 53 /services/pdfgen/is_available?viewId=index_status_health&...1 57 /static/splunkrc_cmds.xml 44
  • Small Clusters: Bursts of OneFind bursts of just a single events where a pause of 2 secondsoccurred around it.… |transaction maxpause=2s | search eventcount = 1Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126…Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum…Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567… 45
  • Burst of OneSame idea, different data source: splunk[11:58:08] "POST /services/search/jobs/export HTTP/1.1" 200 201630 …[11:12:51] "POST /services/search/jobs/export HTTP/1.1" 200 459441 …[10:00:58] "GET /servicesNS/nobody/SplunkDeploymentMonitor/backfill/… 46
  • Anomalous by ContextIdentify values not expected by the context of otherevents.… | anomalies field=file labelonly=true maxvalues=10 47
  • Anomalous by Context Unexpectedness file 0.00 shelper 0.16 shelper 0.00 1345502591.356 0.00 1345502591.356 0.00 1345074401.191 0.00 1345074031.153 time 0.03 1345074328.186 0.00 1345502591.356 0.35 conf-dm_backfill 0.00 1345074309.185 0.00 1345502591.356 48
  • Surprise Eventtype: Part Deux!Classified major categories of your data witheventtypes?-- just search for things that don’t match thoseeventtypes 49
  • 50
  • Once you can describe anomalous behavior as a search… 51
  • 52
  • Other mining commands• kmeans: Performs k-means clustering on selected fields.• outlier: Removes outlying numerical values.• af (analyze fields): Analyzes numerical fields for their ability to predict another discrete field• fieldsummary : Generates summary information fields.• shape: Produces a symbolic shape attribute describing the shape of a numeric multivalued field 53
  • Why do I need to visualize my data? 54
  • Data Mining by VisualizationVisualization can capture nuances in the data thatnumerical or linguistic summaries cannot easily capture. 55
  • These data points are radically different. *Source: Anscombe’s Quartet (Anscombe 1973) 56
  • Why visualize?Because they all have the exact same • average (7.50) • standard deviation (2.03) • least-squares fit (3 + 0.5x).Do not just rely on numerical summarization. 57
  • But I already have charts!You don’t graph enough.Data Exploration Don’t decide ahead of time what graphs you want Regularly do out-of-the-box scenarios with graphs 58
  • Data ExplorationVariations:• Subsets of Events (paying customers vs lookers)• Fields by Fields (including eventtypes and tags)• Ignored fields• Min/max/avg/count• Compare to other times windows• Transactions 59
  • Visual ArrangementSorting data, Changing Scales(Linear/Log), Min/Max can have a huge differenceon looking at the same data. 60
  • Visual Considerations Pick representations that make obvious the distinctions you need to care about. 61
  • Summary 62
  • Summary• Discovery is an iterative process.• Group events by content, format, and time, and define classifications with eventtypes and tags• Focus on promising fields with correlations• Discover unknown anomalies with small clusters.• Visualize your data, from a dozen angles. 63
  • But wait! 64
  • More to come: Predictive Analytics… | forecast foo 65
  • The End Mine the Gap..,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`....,`......_.,`...,`...,`...,`...,`...,`...,`...,`...,`....._.....___..|.|...__._..._.__.,`..._.__.,`..___...__.,`...__.|.|.../.__|.|.|../._`.|.|._.....|._..../._...../././.|.|..|.(__..|.|.|.(_|.|.|.|_).|...|.|.|.|.|.(_).|...V..V./..|_|...___|.|_|..__,_|.|..__/....|_|.|_|..___/...._/_/...(_)..,`...,`...,`...,`..|_|.,`...,`...,`...,`...,`...,`...,`..... Golf clapping at #datamining.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`... 66