SlideShare a Scribd company logo
How to Meta-Sumo
Using Logs for Agile Monitoring of
Production Services

Christian Beedgen
christian@sumologic.com
@raychaser
Who Am I?

     Co-Founder & CTO, Sumo Logic since 2010
      – Cloud-based log management and analytics
      – Applications, operations, security
     Server guy, Chief Architect, ArcSight, 2001 – 2009
      – Major SIEM player in the enterprise space
      – Log management for security and compliance




 2
Yo GlueCon!!@!!211




 3
Yo GlueCon!!@!!211




 4
5
What are Logs?

     The murmurings and whispers of your infrastructure
      – Devices & Services (ex. Security, Email, Authentication)
      – Applications (your code, or 3rd party applications)
     Written to disk by applications
      – Used during development by developers
      – And then often ignored in production
     Yes, Logs are Big Data
      – There’s a lot of Logs ( Volume)
      – A large Variety of formats, plus it’s real-time ( Velocity)
     Free-form text, usually one message per line
      – Scrolls by in your terminal
      – Makes you feel like you’re in the matrix


 6
Application Logging

     Just do it*
      – Use what’s common in your language (Log4J, …)
      – If in doubt, log it - you might need it later, and disk is cheap
     The obvious stuff
      – Always log a precise timestamp
      – Add a log level for filtering
     Building a distributed system?
      – Always add the process/service/module name
      – Add the host ID if you can
     Log the context
      – When processing a request, remember who it is from
      – Within the scope of the request, always log the context
     Bring it all together in a single place
      – Open Source options
      – Commercial options

 7
Just Do It*

     Actually, please do think first
      – Do not log passwords
      – Do not log credit card numbers
      – Do not log anything that’s PII




 8
Or Else…




 http://www.slideshare.net/christoferhoff/shit-my-cloud-evangelist-saysjust-not-to-my-cso




 9
Log Management




 10
Log Management




 11
12
This is the Meta part

      Our system to collect and analyze Logs is actually a
      distributed, cloud-based infrastructure that itself
      generates a ton of logs
      We are running a second, smaller instance that
      receives all the logs from the Prod system – we call
      this our Shadow system
      A majority of the management and monitoring of the
      Prod system is done via the Shadow system, which
      we are madly in love with



 13
My team




 14
My team




          No, really. And we also have a Panda.


 15
Example




 16
Example




      Timestamp with time zone!




 17
Example




      Timestamp with time zone!
      Log level




 18
Example




      Timestamp with time zone!
      Log level
      Host ID & module name (process/service)




 19
Example




      Timestamp with time zone!
      Log level
      Host ID & module name (process/service)
      Code location or class




 20
Example




      Timestamp with time zone!
      Log level
      Host ID & module name (process/service)
      Code location or class
      Authentication context



 21
Example




      Timestamp with time zone!
      Log level
      Host ID & module name (process/service)
      Code location or class
      Authentication context
      Key-value pairs

 22
What about structure?

      Should you rather use JSON, say?
       – Hey, it’s really up to you – machines love it
       – But do humans read the logs as well?
      Modern tools extract structure if there is structure
       – Even if the structure doesn’t conform to a protocol
       – So you don’t strictly have to use a structured format




 23
24
Sumo Pet Trick #1 – Basic Troubleshooting

      All errors in the application have an error instance ID




       UI always tries very hard to display the ID
        – User can easily copy/paste the error ID into a ticket


 25
Sumo Pet Trick #1 – Basic Troubleshooting

      Find AP2V3-HZICT-BP6UO in the logs!




 26
Sumo Pet Trick #2 – Everything breaks

      Distributed systems  lots of different components
       – There will be fail
       – The morning coffee routine…
      There is beauty in numbers
       – Pull out all the logs that contain the term “error” (or “fail”,
         or “exception”, or all of the above)
       – Then count by process/service/module, or host ID
      Even the most basic aggregation will create
      actionable insight



 27
Sumo Pet Trick #2 – Everything breaks




 28
Sumo Pet Trick #2 – Everything breaks




                Schedule
                  this!


 29
Sumo Pet Trick #2 – Everything breaks

      Old school can rock this too!




 30
Sumo Pet Trick #3 – API Choke Points

      What is my API doing?
       – Which APIs are being called?
       – Average execution times per API
       – Who’s calling?




 31
Sumo Pet Trick #3 – API Choke Points

      Number of calls by API




 32
Sumo Pet Trick #3 – API Choke Points

      Average execution time by API




 33
Sumo Pet Trick #3 – API Choke Points

      Number of API calls by user




 34
Sumo Pet Trick #4 – Which searches are running

      Log a session ID on start and stop
       – When there’s long-running operations
       – More than one node participates in generating the result
      For each session in the time range, did you get both
      the start and the stop message?




 35
Sumo Pet Trick #4 – Which searches are running




 36
Sumo Pet Trick #4 – Which searches are running




 37
https://www.sumologic.com/free-trial/



     Christian Beedgen
     christian@sumologic.com
     @raychaser



38
Thank you.
 Ok, and now we start drinking…




 39
http://iam.cat/cat-love-beer

More Related Content

Similar to How to Meta-Sumo - Using Logs for Agile Monitoring of Production Services

Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017
Sumo Logic
 
Dev opsdays 2018 - Observability, the practical approach
Dev opsdays 2018 - Observability, the practical approachDev opsdays 2018 - Observability, the practical approach
Dev opsdays 2018 - Observability, the practical approach
Anton Drukh
 
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
DevOpsDays Tel Aviv
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
Mirco Hering
 
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
Rosemary Wang
 
Six Mistakes of Log Management 2008
Six Mistakes of Log Management 2008Six Mistakes of Log Management 2008
Six Mistakes of Log Management 2008
Anton Chuvakin
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Dan Cundiff
 
Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018
Sumo Logic
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
Dinis Cruz
 
Packer Genetics: The selfish code
Packer Genetics: The selfish codePacker Genetics: The selfish code
Packer Genetics: The selfish code
jduart
 
Sumo Logic QuickStat - Apr 2017
Sumo Logic QuickStat - Apr 2017Sumo Logic QuickStat - Apr 2017
Sumo Logic QuickStat - Apr 2017
Sumo Logic
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
 
TDC 2015 - POA - Trilha PHP - Shit Happens
TDC 2015 - POA - Trilha PHP - Shit HappensTDC 2015 - POA - Trilha PHP - Shit Happens
TDC 2015 - POA - Trilha PHP - Shit Happens
Jackson F. de A. Mafra
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
Python for Machine Learning
Python for Machine LearningPython for Machine Learning
Python for Machine Learning
Student
 
LogChaos: Challenges and Opportunities of Security Log Standardization
LogChaos: Challenges and Opportunities of Security Log StandardizationLogChaos: Challenges and Opportunities of Security Log Standardization
LogChaos: Challenges and Opportunities of Security Log Standardization
Anton Chuvakin
 
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer CycleMonitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Atlassian
 
Splunk for ITOA Breakout Session
Splunk for ITOA Breakout SessionSplunk for ITOA Breakout Session
Splunk for ITOA Breakout Session
Splunk
 
[Pinto] Is my SharePoint Development team properly enlighted?
[Pinto] Is my SharePoint Development team properly enlighted?[Pinto] Is my SharePoint Development team properly enlighted?
[Pinto] Is my SharePoint Development team properly enlighted?
European Collaboration Summit
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
Ad van der Veer
 

Similar to How to Meta-Sumo - Using Logs for Agile Monitoring of Production Services (20)

Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017
 
Dev opsdays 2018 - Observability, the practical approach
Dev opsdays 2018 - Observability, the practical approachDev opsdays 2018 - Observability, the practical approach
Dev opsdays 2018 - Observability, the practical approach
 
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
 
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
 
Six Mistakes of Log Management 2008
Six Mistakes of Log Management 2008Six Mistakes of Log Management 2008
Six Mistakes of Log Management 2008
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
 
Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
Packer Genetics: The selfish code
Packer Genetics: The selfish codePacker Genetics: The selfish code
Packer Genetics: The selfish code
 
Sumo Logic QuickStat - Apr 2017
Sumo Logic QuickStat - Apr 2017Sumo Logic QuickStat - Apr 2017
Sumo Logic QuickStat - Apr 2017
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
TDC 2015 - POA - Trilha PHP - Shit Happens
TDC 2015 - POA - Trilha PHP - Shit HappensTDC 2015 - POA - Trilha PHP - Shit Happens
TDC 2015 - POA - Trilha PHP - Shit Happens
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Python for Machine Learning
Python for Machine LearningPython for Machine Learning
Python for Machine Learning
 
LogChaos: Challenges and Opportunities of Security Log Standardization
LogChaos: Challenges and Opportunities of Security Log StandardizationLogChaos: Challenges and Opportunities of Security Log Standardization
LogChaos: Challenges and Opportunities of Security Log Standardization
 
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer CycleMonitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
 
Splunk for ITOA Breakout Session
Splunk for ITOA Breakout SessionSplunk for ITOA Breakout Session
Splunk for ITOA Breakout Session
 
[Pinto] Is my SharePoint Development team properly enlighted?
[Pinto] Is my SharePoint Development team properly enlighted?[Pinto] Is my SharePoint Development team properly enlighted?
[Pinto] Is my SharePoint Development team properly enlighted?
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

How to Meta-Sumo - Using Logs for Agile Monitoring of Production Services

  • 1. How to Meta-Sumo Using Logs for Agile Monitoring of Production Services Christian Beedgen christian@sumologic.com @raychaser
  • 2. Who Am I? Co-Founder & CTO, Sumo Logic since 2010 – Cloud-based log management and analytics – Applications, operations, security Server guy, Chief Architect, ArcSight, 2001 – 2009 – Major SIEM player in the enterprise space – Log management for security and compliance 2
  • 5. 5
  • 6. What are Logs? The murmurings and whispers of your infrastructure – Devices & Services (ex. Security, Email, Authentication) – Applications (your code, or 3rd party applications) Written to disk by applications – Used during development by developers – And then often ignored in production Yes, Logs are Big Data – There’s a lot of Logs ( Volume) – A large Variety of formats, plus it’s real-time ( Velocity) Free-form text, usually one message per line – Scrolls by in your terminal – Makes you feel like you’re in the matrix 6
  • 7. Application Logging Just do it* – Use what’s common in your language (Log4J, …) – If in doubt, log it - you might need it later, and disk is cheap The obvious stuff – Always log a precise timestamp – Add a log level for filtering Building a distributed system? – Always add the process/service/module name – Add the host ID if you can Log the context – When processing a request, remember who it is from – Within the scope of the request, always log the context Bring it all together in a single place – Open Source options – Commercial options 7
  • 8. Just Do It* Actually, please do think first – Do not log passwords – Do not log credit card numbers – Do not log anything that’s PII 8
  • 12. 12
  • 13. This is the Meta part Our system to collect and analyze Logs is actually a distributed, cloud-based infrastructure that itself generates a ton of logs We are running a second, smaller instance that receives all the logs from the Prod system – we call this our Shadow system A majority of the management and monitoring of the Prod system is done via the Shadow system, which we are madly in love with 13
  • 15. My team No, really. And we also have a Panda. 15
  • 17. Example Timestamp with time zone! 17
  • 18. Example Timestamp with time zone! Log level 18
  • 19. Example Timestamp with time zone! Log level Host ID & module name (process/service) 19
  • 20. Example Timestamp with time zone! Log level Host ID & module name (process/service) Code location or class 20
  • 21. Example Timestamp with time zone! Log level Host ID & module name (process/service) Code location or class Authentication context 21
  • 22. Example Timestamp with time zone! Log level Host ID & module name (process/service) Code location or class Authentication context Key-value pairs 22
  • 23. What about structure? Should you rather use JSON, say? – Hey, it’s really up to you – machines love it – But do humans read the logs as well? Modern tools extract structure if there is structure – Even if the structure doesn’t conform to a protocol – So you don’t strictly have to use a structured format 23
  • 24. 24
  • 25. Sumo Pet Trick #1 – Basic Troubleshooting All errors in the application have an error instance ID UI always tries very hard to display the ID – User can easily copy/paste the error ID into a ticket 25
  • 26. Sumo Pet Trick #1 – Basic Troubleshooting Find AP2V3-HZICT-BP6UO in the logs! 26
  • 27. Sumo Pet Trick #2 – Everything breaks Distributed systems  lots of different components – There will be fail – The morning coffee routine… There is beauty in numbers – Pull out all the logs that contain the term “error” (or “fail”, or “exception”, or all of the above) – Then count by process/service/module, or host ID Even the most basic aggregation will create actionable insight 27
  • 28. Sumo Pet Trick #2 – Everything breaks 28
  • 29. Sumo Pet Trick #2 – Everything breaks Schedule this! 29
  • 30. Sumo Pet Trick #2 – Everything breaks Old school can rock this too! 30
  • 31. Sumo Pet Trick #3 – API Choke Points What is my API doing? – Which APIs are being called? – Average execution times per API – Who’s calling? 31
  • 32. Sumo Pet Trick #3 – API Choke Points Number of calls by API 32
  • 33. Sumo Pet Trick #3 – API Choke Points Average execution time by API 33
  • 34. Sumo Pet Trick #3 – API Choke Points Number of API calls by user 34
  • 35. Sumo Pet Trick #4 – Which searches are running Log a session ID on start and stop – When there’s long-running operations – More than one node participates in generating the result For each session in the time range, did you get both the start and the stop message? 35
  • 36. Sumo Pet Trick #4 – Which searches are running 36
  • 37. Sumo Pet Trick #4 – Which searches are running 37
  • 38. https://www.sumologic.com/free-trial/ Christian Beedgen christian@sumologic.com @raychaser 38
  • 39. Thank you. Ok, and now we start drinking… 39 http://iam.cat/cat-love-beer