Field Extractions: Making Regex Your Buddy

  • 13,738 views
Uploaded on

This presentation was given by Michael Wilde, Splunk Ninja at Splunk's Worldwide User Conference 2011. A demonstration accompanied this presentation. Link is forthcoming.

This presentation was given by Michael Wilde, Splunk Ninja at Splunk's Worldwide User Conference 2011. A demonstration accompanied this presentation. Link is forthcoming.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
13,738
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
200
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Making Reg[Ee]x Your Buddy August  15,  2011 (?i)(mi(chael|ke)  wilde),  Splunk  NinjaThursday, August 18, 11
  • 2. Hi,  I’m  Michael  Wilde • You  may  know  me  from: Splunk Worldwide Users’ Conference 2 © Copyright Splunk 2011Thursday, August 18, 11
  • 3. What  is  RegEx “Finite  Automata” •Regular  Expression  invented  in  the  1950’s  by   mathemaUcian  Stephen  Cole  Kleene •Implemented  by  “ed”  and  “grep”  creator  Ken   Thompson  in  1973 Pa[ern  matching  language  for  text  processing •Has  slightly  different  implementaUons  (PERL,  POSIX)   •Way  crypUc  at  first  sight Splunk Worldwide Users’ Conference 3 © Copyright Splunk 2011Thursday, August 18, 11
  • 4. Why  should  you  care •Field  extracUon  is  a  requirement  for  reporUng •Index-­‐Ume  filtering  &  rouUng •You’ll  seem  smart •It  will  be  useful  beyond  Splunk •You  might  score  with  the  (ladies|dudes)  at   (MakersFaire  |ComiCon). Splunk Worldwide Users’ Conference 4 © Copyright Splunk 2011Thursday, August 18, 11
  • 5. Thinking  RegexThursday, August 18, 11
  • 6. Thinking  Regex •Log  Events  are  a  great  place  to  start,  they  have  structure •Don’t  overthink  it.    The  pa[ern  is  there  waiUng  to   discovered •Don’t  be  lazy  and  use  wildcards  too  much •Learn  to  love  “NOT”  regexes.  S+  D+  W+  [^,]+ Splunk Worldwide Users’ Conference 6 © Copyright Splunk 2011Thursday, August 18, 11
  • 7. Splunk Worldwide Users’ Conference 7 © Copyright Splunk 2011Thursday, August 18, 11
  • 8. Be  nice  to  your  RegEx  engine • MS-­‐DOS  taught  us  to  be   laaaaaaaaaaaaaaaaazy  with  *.* • A  regex  engine  matches  character  by   character,  and  then  does  backtracking. • Match  in  as  few  steps  as  possible Splunk Worldwide Users’ Conference 8 © Copyright Splunk 2011Thursday, August 18, 11
  • 9. Regexes  in  Splunk Search Language: “rex”, “erex”, “regex” Indexing: Filtering data (in|out), line breaking, timestamp extraction Field ExtractionThursday, August 18, 11
  • 10. IFX • Splunk  has  a  built  in  "interacUve  field  extractor" • It  can  be  useful.  Give  it  samples  of  data,  and  it  will  a[empt  to   learn  a  regex  and  persist  a  single  field • It  has  a  limitaUon  of  the  amount  of  events  to  display  in  its   viewer. • You  might  not  see  your  search  results  when  using  it?    Huh? Splunk Worldwide Users’ Conference 10 © Copyright Splunk 2011Thursday, August 18, 11
  • 11. what  if  we  could  use  that  "intelligent"  stuff IFX  was  doing  but  in  the  search  language   •Thursday, August 18, 11 Splunk Worldwide Users’ Conference 11 © Copyright Splunk 2011
  • 12. meet  "erex" • Allows  you  to  give  it  examples,  but  it  works  on  your   search  results • Allows  you  to  give  it  counterexamples  of  stuff  you   dont  want  to  match  on • Builds  you  a  proper  rex  command Splunk Worldwide Users’ Conference 12 © Copyright Splunk 2011Thursday, August 18, 11
  • 13. ...theres  an  app  for  that. right?   Splunk Worldwide Users’ Conference 13 © Copyright Splunk 2011Thursday, August 18, 11
  • 14. Field  Extractor  App • Imagine  you  could  use  your   mouse,  highlight  fields,  name   them,  persist  them,  go  home   early  and  never  write  regex. • David  Carassos  Field  Extractor   app  is  like  a  "workbench  for  field   extracUon" • Download  it  from  SplunkBase Splunk Worldwide Users’ Conference 14 © Copyright Splunk 2011Thursday, August 18, 11
  • 15. searching  with  regexThursday, August 18, 11
  • 16. the  |  regex  search  command • Did  you  know  splunk  crushes  all  terms  to  lower  case? • If  you  need  to  look  for  specific  pa;erns  or  even   words  and  respect  the  case  the  original  events  are  in,   use  |  regex • index=splunktv|regex  _raw="(MP3|M4A)"  <-­‐-­‐noMce   this  is  a  case  sensiMve  pa;ern  match. Splunk Worldwide Users’ Conference 16 © Copyright Splunk 2011Thursday, August 18, 11
  • 17. What  about  good  ole  Rex? • Search  Ume  field   extracUons  via  your  own   regexes  -­‐-­‐  in  the  search   language • Name  your  fields • Reuse  everyone  elses   work! Splunk Worldwide Users’ Conference 17 © Copyright Splunk 2011Thursday, August 18, 11
  • 18. a  few  more  tricks  for  you Splunk Worldwide Users’ Conference 18 © Copyright Splunk 2011Thursday, August 18, 11
  • 19. host  extracUon  irritates  me Splunk Worldwide Users’ Conference 19 © Copyright Splunk 2011Thursday, August 18, 11
  • 20. regex  in  host  extracUon • Splunk  will  a[empt  to  do  the  right  thing.  Log  source  will  likely   make  it  hard  for  Splunk-­‐-­‐and  youll  blame  Splunk • Props.conf  &  transforms.conf  are  needed  to  properly  extract   hostnames  in  some  cases  (F5  Big-­‐IP  and  HP  networking  gear • Use  default  seungs  in  props.conf  and  use  your  own  seungs   as  well Splunk Worldwide Users’ Conference 20 © Copyright Splunk 2011Thursday, August 18, 11
  • 21. priority  boarding  in  props.conf [source::...a...] TRANSFORMS-­‐ahosts  =  ahostextrac:on priority  =  1 [source::...z...] TRANSFORMS-­‐zhosts  =  zhostextrac:on priority  =  99 what  if  the  source  we  were  matching  against  had  the  word  "arizona"  in  it?  It   will  match  both,  right?      Use  "Priority"  to  control  matching.    99  is  higher  than   1.    So  99  is  a  higher  priority.    Yeah,  i  know...  weird. Splunk Worldwide Users’ Conference 21 © Copyright Splunk 2011Thursday, August 18, 11
  • 22. Basic  Training  Complete! Lets  do  something  more difficultThursday, August 18, 11
  • 23. Splunk  is  so  smart except  when  its  not    <policy  id="3">Finjan  HTTPS  policy</policy>    <cp  id="5"  name="AcUve  Content"  display_name="AcUve  Content"/>    <group  id="5002"  cp_id="5"  type="0">Full  profile  -­‐  Binary  Behavior</group>  <item  id="28015">Format  error  in  CRL  lastUpdate  field</item>  <item  id="3265747">*.served.com/*</item>    <rule_comment  id="2"  name="Block  cerUficate  validaUon  errors">&lt;! [CDATA[Block  HTTPS  content  without  a  valid  cerUficate]]&gt;</rule_comment> AUTO-­‐KV  pulled  the  “id”  field  out  of  every  event.    Yay!!! Splunk Worldwide Users’ Conference 23 © Copyright Splunk 2011Thursday, August 18, 11
  • 24. “id”  is  not  the  field  name look  closer  Agent  Starling    <policy  id="3">Finjan  HTTPS  policy</policy>    <cp  id="5"  name="AcUve  Content"  display_name="AcUve  Content"/>    <group  id="5002"  cp_id="5"  type="0">Full  profile  -­‐  Binary  Behavior</group>  <item  id="28015">Format  error  in  CRL  lastUpdate  field</item>    <rule_comment  id="2"  name="Block  cerUficate  validaUon  errors">&lt;! [CDATA[Block  HTTPS  content  without  a  valid  cerUficate]]&gt;</rule_comment> We  can  educate  Splunk  on  dynamically  pulling  the   KEY  and  VALUE  with... Splunk Worldwide Users’ Conference 24 © Copyright Splunk 2011Thursday, August 18, 11
  • 25. Dynamic  Key  Value  ExtracUon ...but  tailored  for  our  needs REGEX  for  the  “KEY”  is  <([^=]+)=    <policy  id="3"> Less   than,   followed   by   (anything   that   is   “not   an   equal   sign-­‐-­‐greedy   match)      <cp  id="5"   followed  by  an  equal  sign  <item  id="28015"> keep  going  dude! REGEX  for  the  “VALUE”  is  ”(    <policy  id="3"> A  quote   (followed  by   anything   that  is  not   a   quote-­‐-­‐greedy   match)   followed   by   a      <cp  id="5"   quote  followed  by  a  greater  than  sign  <item  id="28015"> Splunk Worldwide Users’ Conference 25 © Copyright Splunk 2011Thursday, August 18, 11
  • 26. Persist  your  sweet  dynamic  KV  pa[erns props.conf  &  transforms.conf  required Create  an  entry  in  props.conf  like  this: [m86_dynamic_kv] $1          $2 REPORT-­‐m86fields  =  mym86kv Text Create  an  entry  in  transforms.conf  like  this: [mym86kv] REGEX  =  <([^=]+)="([^"]+)"> FORMAT = $1::$2 <policy  id="3">Finjan  HTTPS  policy</ policy> Splunk Worldwide Users’ Conference 26 © Copyright Splunk 2011Thursday, August 18, 11
  • 27. Dang  it!  It  wasn’t  perfect some  of  our  events  don’t  finish  their  XML  tag  right  a~er  a  quote Create  an  entry  in  props.conf  like  this: [m86_dynamic_kv] $1          $2 REPORT-­‐m86fields  =  mym86kv Text Create  an  entry  in  transforms.conf  like  this: [mym86kv] REGEX  =  <([^=]+)="([^"]+)[^>]+> <rule_comment  id="690"  name="Log  everythin FORMAT = $1::$2 Image  files">&lt;![CDATA[Logs  all  content  passin the  system  except  for  ...... Splunk Worldwide Users’ Conference 27 © Copyright Splunk 2011Thursday, August 18, 11
  • 28. Think  you’re  good? Try  extracUng  the  “service”  field 2011/07/21  19:27:22.071  [(ninja-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ninja-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms Your  job  is  to  create  a  mulU-­‐valued  field  as  the  “service”   field  exists  mulUple  Umes  in  each  event Splunk Worldwide Users’ Conference 28 © Copyright Splunk 2011Thursday, August 18, 11
  • 29. Look  for  the  obvious  pa[erns 2011/07/21  19:27:22.071  [(ela4-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ela4-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms Your  brain  will  tell  you  to  look  for  “anything  a~er  the   first  comma”  a~er  that  le~  bracket  and  before  the   second  comma Splunk Worldwide Users’ Conference 29 © Copyright Splunk 2011Thursday, August 18, 11
  • 30. ...and  your  brain  was  wrong. 2011/07/21  19:27:22.071  [(ela4-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ela4-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms This  is  NOT  a  “service” Dang...  what  are  we  gonna  do  now? Splunk Worldwide Users’ Conference 30 © Copyright Splunk 2011Thursday, August 18, 11
  • 31. What  is  common  with  “services” 2011/07/21  19:27:22.071  [(ela4-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ela4-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms They’re  all  alphanumeric  or   “word”  characters   0-­‐9A-­‐Za-­‐z_ Splunk Worldwide Users’ Conference 31 © Copyright Splunk 2011Thursday, August 18, 11
  • 32. But  what  about  the  preceding  text 2011/07/21  19:27:22.071  [(ela4-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ela4-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms Le~  bracket  followed  by  some  stuff,  followed  by  a  comma..   but  its  not  consistent.    SomeUmes  a  “(“  le~  paren  is  in  there. Splunk Worldwide Users’ Conference 32 © Copyright Splunk 2011Thursday, August 18, 11
  • 33. This  is  a  be[er  match 2011/07/21  19:27:22.071  [(ela4-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ela4-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: [[(-­‐a-­‐zA-­‐Z0-­‐9]+,([a-­‐zA-­‐Z]+), 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms Say  the  matching  paZern  out  loud.    It  will  help Le~  bracket,  followed  by  anything  in  this  character  list  (greedy).  Followed  by  a  comma,  and   then  create  a  capturing  group  of  text  that  matches  upper  or  lower  case  roman  alphabet-­‐-­‐ greedy  (as  many  Umes  as  possible).  End  capturing  group,  then  followed  by  a  comma. Splunk Worldwide Users’ Conference 33 © Copyright Splunk 2011Thursday, August 18, 11
  • 34. Can’t  be  too  hard  to  extend  it,  right? 2011/07/21  19:27:22.071  [(ela4-­‐fe96,opensocial,/makeRequest,2011/07/21   19:27:21.978)[ela4-­‐be04,auth,Auth2Service.recoverSubject]]  []  [Auth2Service]   recoverSubject(V1.21.47,OSM:1t7Dg201000:i: [[(-­‐a-­‐zA-­‐Z0-­‐9]+,([a-­‐zA-­‐Z]+),[^[]+[[(-­‐ 1311276436:1d00a2fc1f9addd936af12ed5c430a169c362af8,null,shindig, 172.17.207.243,)=[Principal[3],[OSM:1t7Dg201000:i: a-­‐zA-­‐Z0-­‐9]+,([a-­‐zA-­‐Z]+), 1311276439:20d1d0b474927a301376d70f2ad5949a2241e271,false,1h]]  in  1ms Le~  bracket,  followed  by  anything  in  this  character  list  (greedy).  Followed  by  a  comma,  and  then   create  a  capturing  group  of  text  that  matches  upper  or  lower  case  roman  alphabet-­‐-­‐greedy  (as   many  Umes  as  possible).  End  capturing  group,  then  followed  by  a  comma.    Followed  by  anything   that  is  NOT  a  Le~  Bracket,  followed  by..... Splunk Worldwide Users’ Conference 34 © Copyright Splunk 2011Thursday, August 18, 11
  • 35. Sad  Trombone This  one  has  four  services 2011/07/21  19:27:27.596  [(ninja4-­‐fe29,genie,/handle,131292312,2011/07/21   19:27:27.310)[ninja4-­‐ be716,lmt,PbContentService.write<tetherAccountData;default>][ninja4-­‐ be05,tether,TetherAccountService.bindAccount][ninja4-­‐ be393,auth,Auth2Service.upgradeSubject]]  []  [Auth2Service]   upgradeSubject(V1.21.49,"INT",[LIM:131292312:s: 1311276361:b8f677d957eb3f7b9622247b72374c791720bc17,true], {internalAppName=twitter-­‐sync},"tether",null)=[Principal[2],[INT: 131292312/twitter-­‐sync: 1311276447:df9dd0175bd2e6107c2dfae36dfd9a9dc11f0631,false,20y]]  in  15ms Splunk Worldwide Users’ Conference 35 © Copyright Splunk 2011Thursday, August 18, 11
  • 36. Remember  “rex”? He  devours  data But  you  can  make  “rex”  very  hungry  and   control  how  much  lunch  he  eats.    By   default,  he  only  gets  “one  helping  of  meat” Splunk Worldwide Users’ Conference 36 © Copyright Splunk 2011Thursday, August 18, 11
  • 37. Using  max_match  with  rex You  limit  or  expand  the  number  of  Umes  it  runs rex max_match=20 "[[(-a-zA-Z0-9]+,(?<service>[a-zA-Z]+)," Instead  of  that  last  regex  that  matched  “two”  services,  lets   just  match  one,  and  tell  rex  to  repeat  our  pa[ern  matching   Splunk Worldwide Users’ Conference 37 © Copyright Splunk 2011Thursday, August 18, 11
  • 38. You  can  persist  this  in  config  files props.conf  &  transforms.conf  required Create  an  entry  in  props.conf  like  this: [ninjasocial] REPORT-­‐ninjafields  =  myepicregex Create  an  entry  in  transforms.conf  like  this: [myepicregex] REGEX  =  [[(-a-zA-Z0-9]+,(?<service>[a-zA-Z]+), MV_ADD = TRUE Splunk Worldwide Users’ Conference 38 © Copyright Splunk 2011Thursday, August 18, 11
  • 39. And  now  for  something  difficult gaming  logs  -­‐  Team  Fortress L  08/02/2011  -­‐  11:46:05:  "The   Administrator<61><BOT><Red>"  killed   "MoreGun<56><BOT><Blue>"  with   "flamethrower"  (attacker_position   "-­‐2677  2177  -­‐127")  (victim_position   "-­‐2555  2323  -­‐127") Splunk Worldwide Users’ Conference 39 © Copyright Splunk 2011Thursday, August 18, 11
  • 40. I  need  the  data gaming  logs  -­‐  Team  Fortress L  08/02/2011  -­‐  11:46:05:  "The   Administrator<61><BOT><Red>"  killed   "MoreGun<56><BOT><Blue>"  with   "flamethrower"  (attacker_position   "-­‐2677  2177  -­‐127")  (victim_position   "-­‐2555  2323  -­‐127") Splunk Worldwide Users’ Conference 40 © Copyright Splunk 2011Thursday, August 18, 11
  • 41. Who’s  who? How  do  we  know  who  did  what  to  whom? L  08/02/2011  -­‐  11:46:05:  "The   Administrator<61><BOT><Red>"  killed   "MoreGun<56><BOT><Blue>"  with   "flamethrower"  (attacker_position   "-­‐2677  2177  -­‐127")  (victim_position   "-­‐2555  2323  -­‐127") Splunk Worldwide Users’ Conference 41 © Copyright Splunk 2011Thursday, August 18, 11
  • 42. actor actor_id actor_team actor_type L  08/02/2011  -­‐  11:46:05:  "The   Administrator<61><BOT><Red>"  killed   "MoreGun<56><BOT><Blue>"  with   "flamethrower"  (attacker_position   "-­‐2677  2177  -­‐127")  (victim_position   "-­‐2555  2323  -­‐127") actee actee_id actee_type actee_team Splunk Worldwide Users’ Conference 42 © Copyright Splunk 2011Thursday, August 18, 11
  • 43. Didn’t  we  see  this  slide  before? How  do  we  know  who  did  what  to  whom? L  08/02/2011  -­‐  11:46:05:  "The   Administrator<61><BOT><Red>"  killed   "MoreGun<56><BOT><Blue>"  with   "flamethrower"  (attacker_position   "-­‐2677  2177  -­‐127")  (victim_position   "-­‐2555  2323  -­‐127") Splunk Worldwide Users’ Conference 43 © Copyright Splunk 2011Thursday, August 18, 11
  • 44. See  that  pa[ern?    Remember   “max_match”? L  08/02/2011  -­‐  11:46:05:  "The   Administrator<61><BOT><Red>"  killed   "MoreGun<56><BOT><Blue>"  with   "flamethrower"  (attacker_position   "-­‐2677  2177  -­‐127")  (victim_position   "-­‐2555  2323  -­‐127") Splunk Worldwide Users’ Conference 44 © Copyright Splunk 2011Thursday, August 18, 11
  • 45. See  that  pa[ern?    Remember   “max_match”? "The  Administrator<61><BOT><Red>"   "MoreGun<56><BOT><Blue>"   Using  rex  /  mv_add,  lets  capture  it  in  to  some  temporary  “mul9-­‐value”  fields Splunk Worldwide Users’ Conference 45 © Copyright Splunk 2011Thursday, August 18, 11
  • 46. “Temporary”  MulUValue  Fields actor_name_z The  Administrator,MoreGun actor_id_z 61,56 actor_type_z BOT,BOT actor_team_z Red,Blue Using  rex  /  mv_add,  lets  capture  it  in  to  some  temporary  “mul9-­‐value”  fields Splunk Worldwide Users’ Conference 46 © Copyright Splunk 2011Thursday, August 18, 11
  • 47. Evaluate  &  Transform  with  “mvindex” mul9-­‐value  fields  have  an  “posi9on  value”  in  the  array mvindex 0                                    1 actor_name_z The  Administrator,MoreGun actor_id_z 61,  56 actor_type_z BOT,BOT actor_team_z Red,Blue Splunk Worldwide Users’ Conference 47 © Copyright Splunk 2011Thursday, August 18, 11
  • 48. Its  Ume  for  our  fields  to  split  up! mul9-­‐value  fields  have  an  “posi9on  value”  in  the  array |  eval  actor_name  =   mvindex(actor_name_z,0)|  eval   actee_name  =  mvindex(actor_name_z,1)   actor_name  =  The  Administrator actee_name  =  MoreGun Splunk Worldwide Users’ Conference 48 © Copyright Splunk 2011Thursday, August 18, 11
  • 49. Resources • regexlib.com • regular-­‐expressions.info • gskinner.com/RegExr • Reggy  /  RegExhibit • RegexBuddy  (JGSo~.com)Thursday, August 18, 11
  • 50. Questions,  just  ask! Michael  Wilde,  Splunk  Ninja ninja@splunk.comThursday, August 18, 11