• Save
SplunkLive! Dallas Nov 2012 - Metro PCS

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
992
On Slideshare
991
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 1

http://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • What is MetroPCS, what does it entail? What does the playing field look like?
  • What is MetroPCS, what does it entail? What does the playing field look like?
  • Lessons learned, share best practices with the audience....things they need to look out for.
  • At search time, Splunk looks up the current partner tariffs from an external RDBMS, integrating this data and calculating actual charges based on the call duration in the CDR. Splunk also calculates the best rate and telecommunications partner, which makes it easy to determine whether lowest-cost partners are being used in all cases. Splunk performs the external tariff lookup and calculation at a rate of over 10,000 CDRs/second on a single commodity x86 server.
  • Could also use Splunk to enforce SLAs with the MSSP
  • Every data session has 2 attributes: the home subscriber carrier (“SUB”) and the current service carrier (“SVC”) and these correspond to 2 categories of eventtypes.  If for example we have a data session that is a Sprint roamer using our network, the accounting record for this will have eventtypes SPRINT_SUB and MPCS_SVC.  This is allows to very easily discriminate different categories of traffic without having to directly examine any IP addresses or SIDs and only maintaining this (always changing) data in a single place, OUTSIDE of the searches.
  • Lessons learned, share best practices with the audience....things they need to look out for.
  • The naming convention is important because the only variable a designer may change that gets passed to a script from an automated search is the searches name so all data must be encoded into the search name.  In our case, any search that calls our “sendSNMPtrap.pl” script that starts with the string “SNMP: “ will derive the following fields from the search name:#   * <PRODUCT> is the name of the service that has the problem (e.g. PGW)#   * <HOST> is the name/IP-Address of the server that has the problem#   * <SEVERITY> = <INFO, MINOR, MAJOR, CRITICAL>#   * <DESCRIPTION> is any plain text but fewer than 100 characters total and#     DO NOT USE THE COLON CHARACTER or it will be truncated after the 1st one! Obviously, each search must be limited to the one particular <HOST> that is listed in the <HOST> portion of the search's name and a clone of each search must be scheduled for each host.
  • The naming convention is important because the only variable a designer may change that gets passed to a script from an automated search is the searches name so all data must be encoded into the search name.  In our case, any search that calls our “sendSNMPtrap.pl” script that starts with the string “SNMP: “ will derive the following fields from the search name:#   * <PRODUCT> is the name of the service that has the problem (e.g. PGW)#   * <HOST> is the name/IP-Address of the server that has the problem#   * <SEVERITY> = <INFO, MINOR, MAJOR, CRITICAL>#   * <DESCRIPTION> is any plain text but fewer than 100 characters total and#     DO NOT USE THE COLON CHARACTER or it will be truncated after the 1st one! Obviously, each search must be limited to the one particular <HOST> that is listed in the <HOST> portion of the search's name and a clone of each search must be scheduled for each host.
  • The reason it went undetected for so long was that the process writing the data put different lead characters in each event because it was writing to 2 different places.  I had to ignore the first portion of each event and compare only the last portion.  Here is the transaction command I  used which creates a field called “lastChars” consisting of everything beyond the first “[“ character and shows me any duplicate events which come from different sources:index=xxx | rex field=_raw ".*?\\[(?<lastChars>.*)$" | transaction lastCharsmaxpause=0 maxspan=0 keepevicted=true | where mvcount(source) > 1In the latter use, we have many operations that should have a “start” event and an “end” event and sometimes we are missing one or the other in our data.  Here is a search that will show these events:index=xxx | transaction SomeFieldmaxpause=1s | where linecount > 2 | stats count Don’t forget that when you run any search you can click on the icon of horizontal lines just underneath the “X results in the last Y <timeframe>” text and see the individual events that made up the search results data.
  • We like to monitor our data retention and were able to use the _internal index to automate this.  We run the following search every day for the previous day and it shows every index that has had data purged and the timespan of the remaining data. We run this from within a script which parses the “retentionDays” values returned and compares them with our retention targets for each index and sends alerts when we cross particular thresholds.
  • Lessons learned, share best practices with the audience....things they need to look out for.

Transcript

  • 1. Copyright © 2012 Splunk, Inc.Gregg WoodcockSr Member of Technical Staff
  • 2. About MetroPCS• Provider of unlimited wireless communications service for a flat-rate with no annual contract• Fifth largest facilities-based wireless carrier in the United States• Approximately 9 million subscribers
  • 3. Gregg Woodcock• Extensive telecom experience creating and supporting Nortels wireless products for 18+ years• My team designs/creates/deploys or evaluates/integrates software and hardware to support the resiliency and expansion of our private mobile telephone network• Anything with a data connection flows through my group (GPS, Push-to-Talk, browser, VOIP)• Previously with Sabre, Mobile Metrics and Nortel
  • 4. The Challenge at MetroPCSLaunching New Products and Tariff Justification/ Services Optimization Speeding application de-bugging Splunk statistical analysis uses RDBMS means bringing new products to lookup to calculate cost per call market faster Call Detail Record Visibility Detecting AbuseSplunk’s ability to ingest any format Reports and dashboards highlightwithout parsers or adapters speeds possible abusers—key indicator of deployment and time to value Terms of Service Abusers
  • 5. Speeding De-bugging, Speeding Time to Market Launching a new handset every month! We’ve gone from hours to minutes for troubleshooting issues Self-service, secure access No more bouncing issues from group to group Same types of bugs happen each time—so we’ve automated searches and alerts for these known patterns“Splunk helps us uncover most bugs before we go into production, improvesuser experience and gets us to market much more quickly.”
  • 6. Speed of Implementation• Splunk was up and running on commodity hardware in 2 weeks• Incumbent product would have required 8-month services engagement
  • 7. Unexpected Benefits• Open source de-bugging (everyone can play : teach men to fish)!• Overall understanding of baseline system health• Fast subpoena compliance for Law Enforcement inquiries – CDR analysis “Better able to close the door on all kinds of leaks!”
  • 8. Lowest Cost Call Routing  Revenue Optimization High Priced Tariffs with no Visibility Optimal call routes difficult to track or understand Manual mediation of tariff information was 3+ month exercise—often still without desired results, resulting in higher than necessary fees • Splunk helps us understand actual partner costs by looking at partner tariffs from an external RDBMS and calculating actual charges based on call duration • We now have the ability to optimize call routing • Lowest cost routing has direct impact on bottom line--saving hundreds of thousands of dollars
  • 9. Daily Revenue Optimization Dashboard
  • 10. Carrier and Destination Savings Dashboard
  • 11. Interesting Splunk StoryWhen an earthquake hit Trinidad and Tobago we knew about it before the news broke (“What’s up with Trinidad and Tobago?”). When theincoming lines went down our searches looking for Bad Answer/Seize Ratios (ASR) showed this route’s ASR drop through the floor.
  • 12. Growing Business Means Growing Data Volumes Business growth is good—but growth without understanding can be crippling CDR data alone is >1TB/ day Correlating with other network and external data more challenging still • Splunk allows us to link the rich data in CDRs with external RDBMSs, systems and networks • New visibility has highlighted new business opportunity and exposed abuse
  • 13. Detecting EULA violations– Abuse Dashboard
  • 14. Android Smartphone Launch• Data usage skyrockets overnight, beyond expectations – What handset types were being used? – Where and when is usage happening? – Track usage broken down day-to-day, hour-by-hour• Available Data source - Radius accounting records which contains many key details: • MDN, Realm, BSID (SID/NID/Cell/Sector encoded), PCF IP Address (32-bit hex), Service Option (Radio Technology, e.g. 1xRTT, EVDO), Bytesin, BytesOut 14
  • 15. Lookup CSV• Radius user – MDN@Realm• Realm is 1-to-1 mapping to device type• Derive fields to show handset type LOOKUP-realm2device_lookup = realm2device_lookup realm OUTPUT UImake,UImodel UIOS 15
  • 16. Field Extraction• Convert BSID field to 4 other derived fields• [bsid_to_sid_nid_cell_sector]• SOURCE_KEY=BSID• REGEX=^(?<SID >.{4})(?<NID >.{4})(?<Cell >.{3})(?<Sector >.{1}) 16
  • 17. Human-readable market name• Convert SID number into a human-readable market name (e.g. DFW) • LOOKUP-SID2market_lookup = SID2market_lookup SID OUTPUT MKTfromSID • LOOKUP-service_option_lookup = service_option_lookup SO OUTPUT SO_ 17
  • 18. My First Macro! • Convert 32-bit hex PCF IP Address into dotted-quad format[32bit2dottedquad(2)]args = IP32bit, nameIPdottedquaddefinition = eval ip0=floor($IP32bit$/16777216) | evalipx=$IP32bit$%16777216 | eval ip1=floor(ipx/65536) | evalipx=ipx%65536 | eval ip2=floor(ipx/256) | eval ip3=ipx%256 | strcatip0 "." ip1 "." ip2 "." ip3 $nameIPdottedquad$ 18
  • 19. Event Typing for Carrier Discrimination• Every data session has 2 attributes corresponding with 2 event types: – Subscriber’s home carrier (“SUB”) – Data session’s service carrier (“SVC”) SPRINT_SUB MPCS_SVC 19
  • 20. Create Any Usage Report• Count users, sessions, bytes-in, bytes-out, bytes-total• Break down by user(MDN), carrier(MPCS, Sprint, etc.) handset (UIMake,UIModel), OS (UIOS: Android, RIM, BREW, Windows), region/market (MKTfromSID), PCF(PCFIP_32bitAddr or PCFIP_dottedQuad), Cell(Cell), Sector (Cell,Sector), radio technology (ServiceOption=1xRTT/EVDO/LTE) or any combination thereof. 20
  • 21. Splunk Saves Vendor’s Launch Date!• Desire to accommodate a vendor’s beta software that did not have the SNMP alerting portion updated• This service generated exploitable error logs that we were already “Splunking”• Enter SplunkBase• Modified a “send SNMP” script we found and created scheduled searches that automatically raise alarms in our NOC 21
  • 22. Search for a Naming Convention[SNMP: PGW: blade3a: MAJOR: BWS OracleSequence Number exhaustion]action.email = 1action.email.sendresults = 1action.email.to = Gwoodcock@metropcs.comaction.script = 1action.script.filename = sendSNMPtrap.plcounttype = number of events 22
  • 23. Search for a Naming Convention (cont’d)cron_schedule = 0 * * * *description = If this sequence number "tops out" the PGW willfail all transactions!dispatch.earliest_time = -1h@hdispatch.latest_time = nowenableSched = 1quantity = 0relation = greater thansearch = <REDACTED> 23
  • 24. The Space-Saving Transaction Command• A large amount of duplicates went undetected• Needed to identify duplicate (but not identical) or unmatched events• Created a field called “lastChars” that helped me determine duplicates from different sources index=xxx | rex field=_raw ".*?[(?<lastChars>.*)$" | transaction lastChars maxpause=0 maxspan=0 keepevicted=true | where mvcount(source) > 1 24
  • 25. _internal Index Automation • Automated data retention • Reveals every index that had data purged • Threshold based alertsindex=_internal sourcetype=splunkd bucketmover "will attempt to freeze" | rexfield=_raw"/splunk_data/[^/]*/(?<indexname>[^/]*)/db_(?<newestTime>[^_]*)_(?<oldestTime>[^_]*)_.*" | dedup indexname | eval retentionDays=(now()-oldestTime)/(60*60*24) |stats values(retentionDays) as retentionDays by indexname 25
  • 26. Lessons Learned• Do disk partitioning for indices at virtual/software/conf layer, not LUN/hardware layer• Always have way more disk that you think you’ll need• Always have more indexers than you think you’ll need• PUT THE DEPLOYMENT SERVER IN FIRST (some pain to wedge in later and you WILL have to do this eventually)• Keep up with upgrades (many bug fixes)• Convert discoveries into scheduled searches (don’t have the same “surprise” twice) 26
  • 27. Copyright © 2012 Splunk, Inc.Thank you