SlideShare a Scribd company logo
Logging Makes
Perfect
Itamar	
  Syn-­‐Hershko	
  
h1p://code972.com	
  
@synhershko	
  
Real-world monitoring
and visualizations
Me?
•  Itamar	
  Syn-­‐Hershko	
  /	
  @synhershko	
  
•  Blog	
  at	
  h1p://code972.com	
  
•  Long	
  @me	
  OSS	
  contributor	
  
•  Lucene.NET	
  PMC	
  and	
  lead	
  commi1er	
  
•  Elas@csearch	
  consul@ng	
  partner	
  
•  Building	
  soKware	
  to	
  catch	
  fraudsters	
  @	
  Forter	
  
$15K
Purchase Amount Washington (high income)
Billing Neighborhood
Platinum+
Credit Card Type
Kibuve, Rwanda
IP Location
Virginia (Low income)
Shipping Neighborhood
< $100
Past Purchase Amounts
Ivy League School
Billing Address
Expedited Shipping
Shipping type
Sean Smith, Macbook Air x 3
DECLINE
D
$2,450
Purchase Amount
Washington (high income)
Billing Neighborhood
Platinum+
Credit Card Type
Kibuve, Rwanda
IP Location
Virginia (Low income)
Shipping Neighborhood
< $100
Past Purchase Amounts
Ivy League School
Billing Address
(202) 222-3850
Phone
Sean is an officer, serving overseas in Rwanda.
His shipping address matches the pattern of diplomat
mailboxes for US embassies and his billing address is
still linked to his old college accommodation.
How did we know?	
  
	
  
Sean Smith, Macbook Air x 3
BEHAVIORAL DATA
REAL-TIME APPROVED
Different systems, different SLAs
	
  
	
  
Performance	
   Data	
  Loss	
   Business	
  Logic	
  
TX	
  Processing	
   Low	
  Latency	
   Nope	
   Best	
  Effort	
  
Stream	
  Processing	
   High	
  Throughput	
   Best	
  Effort	
   Best	
  Effort	
  
Batch	
  Processing	
   High	
  Volume	
   Nope	
   Reconcilia@on	
  
Forter’s TX Processing API
Forter	
  
Transac@on	
  details	
  
	
  
Decision	
  
(approve	
  /	
  decline)	
  
milliseconds	
  
Achieving Low Latencies
For	
  comparison:	
  
•  A	
  few	
  network	
  hops	
  between	
  AWS	
  availability	
  zones	
  (~1ms)	
  
•  JVM	
  young	
  garbage	
  collec@on	
  of	
  a	
  few	
  hundred	
  MBs	
  (~20ms)	
  
•  Read	
  by	
  primary	
  key	
  from	
  SSD	
  storage	
  (~30ms)	
  
•  US	
  West	
  Coast	
  to	
  East	
  Coast	
  network	
  latency	
  (~100ms)	
  
•  JVM	
  full	
  garbage	
  collec@on	
  a	
  few	
  hundred	
  MBs	
  
(~500ms-­‐1000ms)	
  
Distributed Systems
Our stack
	
  
nginx	
  (lua)	
  
	
  
expressjs	
  (nodejs)	
  
	
  
Storm	
  (java)	
  
	
  
	
  
Stability	
  
Code	
  
Changes	
  
Monitoring and Alerting
What do we monitor?
•  Latencies	
  
– Benchmark	
  tools	
  are	
  useless	
  here	
  
– Always	
  prefer	
  percen@les,	
  measure	
  the	
  long	
  tail	
  
– Dis@nguishing	
  real	
  traffic	
  from	
  synthe@c	
  requests	
  
	
  
•  Availability	
  
– No	
  non-­‐200	
  HTTP	
  status	
  codes	
  returned	
  
•  High	
  /	
  suspicious	
  metrics	
  
•  Errors,	
  excep@ons	
  
•  Wrong	
  processing	
  (data	
  loss?)	
  
Detec@on	
  
filter	
  &	
  
route	
  
alert	
  
diagnos@cs	
  
•  Pingdom	
  probes	
  -­‐	
  low	
  coverage,	
  reliable	
  
•  CloudWatch,	
  collectd	
  -­‐	
  fast,	
  no	
  root	
  cause	
  
•  Internal	
  probes	
  -­‐	
  be1er	
  coverage,	
  false	
  
alarms	
  
•  libbeat	
  and	
  logstash	
  for	
  monitoring	
  3rd	
  party	
  
applica@ons	
  –	
  some@mes	
  noisy,	
  false	
  alarms	
  
•  App	
  events	
  (excep@ons)	
  -­‐	
  too	
  noisy,	
  root	
  
cause	
  
Redundancy of detection sources
….	
  is	
  important!	
  
•  Different	
  concerns	
  
•  Varying	
  accuracy	
  and	
  verbosity	
  
•  High	
  availability	
  
•  Correla@on	
  
Automatic Latency Reporting: NodeJS
georg	
  (named	
  aKer	
  Georg	
  Friedrich	
  Bernhard	
  Riemann)	
  is	
  a	
  
nodejs	
  library	
  that	
  acts	
  as	
  an	
  in-­‐process	
  riemann	
  agent.	
  The	
  
library	
  can	
  send	
  nodejs	
  specific	
  and	
  applica@on	
  specific	
  
metrics	
  to	
  riemann.	
  
	
  
h1ps://github.com/forter/georg	
  
	
  
Apache Storm Basics
Automatic Latency Reporting: Apache Storm
	
  
	
  
	
  
	
  
	
  
	
  
	
  
h1ps://github.com/forter/riemann-­‐storm-­‐monitor	
  
	
  
The ELK Stack
Queue	
  
(Redis)	
  
“Shippers”	
  
“Indexer”	
  
The ELK Stack + Riemann
Queue	
  
(Redis)	
  
“Shippers”	
  
•  Detect	
  failures	
  
•  Expect	
  different	
  latencies	
  from	
  different	
  
components,	
  dynamically	
  
•  Tolerate	
  spikes	
  
•  High	
  /	
  low	
  watermarks	
  
•  Reshape	
  event	
  streams	
  (split,	
  merge,	
  …)	
  
Done	
  via	
  Riemann	
  –	
  an	
  event	
  stream	
  processor	
  
Riemann
; Listen on the local interface over TCP (5555), UDP (5555) & websockets
; (5556)
(let [host "127.0.0.1"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server {:host host}))
; Expire old events from the index every 5 seconds.
(periodically-expire 5)
(let [index (index)]
; Inbound events will be passed to these streams:
(streams
(default :ttl 60
; Index all events immediately.
index
; Log expired events.
(expired
(fn [event] (info "expired" event))))))
Filter tests using a state machine
(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch
[key]
(let [pd (pagerduty key)]
(changed-state {:init "passed"}
(where (state "passed") (:resolve pd))
(where (state "failed") (:trigger pd)))))
Filter tests using a state machine
Re-open manually resolved alert
Re-open manually resolved alert
(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch
[key]
(let [pd (pagerduty key)]
(sdo
(changed-state {:init "passed"}
(where (state "passed") (:resolve pd)))
(where (state "failed")
(by [:host :service]
(throttle 1 60 (:trigger pd)))))))
Riemann Events
Riemann’s Index
Key	
  (host	
  +	
  service)	
   Event	
   TTL	
  
10.0.0.1-­‐redisfree	
   {"metric":"5"}	
   60	
  
10.0.0.1-­‐probe1	
   {"state":"failed"}	
   300	
  
10.0.0.2-­‐probe1	
   {"state":"passed”}	
   300	
  
Riemann Ghosts
•  “Expired	
  events”	
  
•  Use	
  case:	
  Detect	
  dead	
  services	
  
•  Use	
  case:	
  Cron	
  stopped	
  working	
  (e.g.	
  backup)	
  	
  
Riemann Overloaded
key	
  (IP	
  +	
  cookie)	
   event	
   TTL	
  
199.25.1.1-­‐1234	
   {"state":"loaded"}	
   300	
  
199.25.2.1-­‐4567	
   {"state":"downloaded"}	
   300	
  
199.25.3.1-­‐8901	
  
	
  
{"state":"loaded"}	
  
	
  
300	
  
Something failed, now what?
•  Always	
  prefer	
  auto-­‐healing	
  
– Humans	
  are	
  slow	
  
– But	
  careful	
  from	
  a	
  domino	
  effect	
  
•  Fail	
  fast	
  and	
  automa@c	
  failover	
  
– Code	
  throws	
  excep@on,	
  logs	
  it	
  and	
  moves	
  forward	
  
– Fatal	
  excep@ons	
  will	
  cause	
  the	
  process	
  to	
  restart	
  (upstart)	
  
– Auto	
  scaling	
  groups	
  keep	
  enough	
  machines	
  up	
  
– ELB	
  will	
  phase	
  out	
  dead	
  stacks	
  
– HTTP	
  fencing	
  (commercial	
  products)	
  
Alert
•  When	
  human	
  interven@on	
  is	
  required	
  
•  Urgent:	
  immediate	
  ac@on	
  
– PagerDuty	
  (phone	
  calls,	
  SMS,	
  push	
  no@fica@ons)	
  
•  Low	
  urgency:	
  events	
  are	
  logged	
  and	
  
reviewed	
  the	
  morning	
  aKer	
  
– PagerDuty	
  (Phone	
  push	
  no@fica@ons)	
  
– Slack	
  
– Email	
  
Diagnostics
The ELK stack
For	
  real-­‐@me	
  analy@cs	
  and	
  dashboards	
  
The ELK Stack + Riemann
Queue	
  
(Redis)	
  
“Shippers”	
  
Analyzing sleep patterns J
*	
  Times	
  are	
  Israel	
  Day	
  Time	
  
Storm topology visualization & timing
Storm timelines
Alerting via anomaly detection
To conclude…
•  Fail	
  fast	
  
– With	
  back-­‐off	
  
•  Log	
  everything	
  
•  Aler@ng	
  via	
  state	
  machines	
  
•  By	
  host,	
  service,	
  @me	
  
•  Alerts	
  are	
  useless	
  without	
  proper	
  debug	
  tools	
  
•  Try	
  to	
  self-­‐heal	
  when	
  possible.	
  
Thank	
  you	
  
Ques>ons?	
  
Itamar	
  Syn-­‐Hershko	
  
h1p://code972.com	
  
@synhershko	
  /	
  @ForterTech	
  
	
  

More Related Content

What's hot

“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
Amazon Web Services
 
Thoughts about DNS for DDoS
Thoughts about DNS for DDoSThoughts about DNS for DDoS
Thoughts about DNS for DDoS
APNIC
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
Karthik Deivasigamani
 
WAMP as a platform for composite SOA applications and its implementation on Lua
WAMP as a platform for composite SOA applications and its implementation on LuaWAMP as a platform for composite SOA applications and its implementation on Lua
WAMP as a platform for composite SOA applications and its implementation on Lua
Konstantin Burkalev
 
Secure360 - Attack All the Layers! Again!
Secure360 - Attack All the Layers! Again!Secure360 - Attack All the Layers! Again!
Secure360 - Attack All the Layers! Again!
Scott Sutherland
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
Steven Wu
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaBack-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Akara Sucharitakul
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Monal Daxini
 
Lateral Movement: How attackers quietly traverse your Network
Lateral Movement: How attackers quietly traverse your NetworkLateral Movement: How attackers quietly traverse your Network
Lateral Movement: How attackers quietly traverse your Network
EC-Council
 

What's hot (10)

“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
 
Thoughts about DNS for DDoS
Thoughts about DNS for DDoSThoughts about DNS for DDoS
Thoughts about DNS for DDoS
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
WAMP as a platform for composite SOA applications and its implementation on Lua
WAMP as a platform for composite SOA applications and its implementation on LuaWAMP as a platform for composite SOA applications and its implementation on Lua
WAMP as a platform for composite SOA applications and its implementation on Lua
 
Secure360 - Attack All the Layers! Again!
Secure360 - Attack All the Layers! Again!Secure360 - Attack All the Layers! Again!
Secure360 - Attack All the Layers! Again!
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaBack-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Lateral Movement: How attackers quietly traverse your Network
Lateral Movement: How attackers quietly traverse your NetworkLateral Movement: How attackers quietly traverse your Network
Lateral Movement: How attackers quietly traverse your Network
 

Similar to Logging makes perfect - Riemann, Elasticsearch and friends

Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
Sagi Brody
 
Distributed monitoring
Distributed monitoringDistributed monitoring
Distributed monitoringLeon Torres
 
The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solution
Radu Vunvulea
 
Network Monitoring Basics
Network Monitoring BasicsNetwork Monitoring Basics
Network Monitoring Basics
Rob Dunn
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
Monitoring patterns for mitigating technical risk
Monitoring patterns for  mitigating technical riskMonitoring patterns for  mitigating technical risk
Monitoring patterns for mitigating technical risk
Itai Frenkel
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
Logan Best
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
Stéphane Maldini
 
Ingesting data streams with a serverless infrastructure
Ingesting data streams with a serverless infrastructureIngesting data streams with a serverless infrastructure
Ingesting data streams with a serverless infrastructure
Gil Colunga
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
amesar0
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
Peter Bakas
 
RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob Holcomb
Priyanka Aash
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
KafkaZone
 
lecture5.pptx
lecture5.pptxlecture5.pptx
lecture5.pptx
Llobarro2
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
Xin Wang
 

Similar to Logging makes perfect - Riemann, Elasticsearch and friends (20)

Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Distributed monitoring
Distributed monitoringDistributed monitoring
Distributed monitoring
 
The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solution
 
Network Monitoring Basics
Network Monitoring BasicsNetwork Monitoring Basics
Network Monitoring Basics
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Monitoring patterns for mitigating technical risk
Monitoring patterns for  mitigating technical riskMonitoring patterns for  mitigating technical risk
Monitoring patterns for mitigating technical risk
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Ingesting data streams with a serverless infrastructure
Ingesting data streams with a serverless infrastructureIngesting data streams with a serverless infrastructure
Ingesting data streams with a serverless infrastructure
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob Holcomb
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
lecture5.pptx
lecture5.pptxlecture5.pptx
lecture5.pptx
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 

Recently uploaded

Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
abdulrafaychaudhry
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
vrstrong314
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 

Recently uploaded (20)

Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 

Logging makes perfect - Riemann, Elasticsearch and friends

  • 1. Logging Makes Perfect Itamar  Syn-­‐Hershko   h1p://code972.com   @synhershko   Real-world monitoring and visualizations
  • 2. Me? •  Itamar  Syn-­‐Hershko  /  @synhershko   •  Blog  at  h1p://code972.com   •  Long  @me  OSS  contributor   •  Lucene.NET  PMC  and  lead  commi1er   •  Elas@csearch  consul@ng  partner   •  Building  soKware  to  catch  fraudsters  @  Forter  
  • 3.
  • 4.
  • 5. $15K Purchase Amount Washington (high income) Billing Neighborhood Platinum+ Credit Card Type Kibuve, Rwanda IP Location Virginia (Low income) Shipping Neighborhood < $100 Past Purchase Amounts Ivy League School Billing Address Expedited Shipping Shipping type Sean Smith, Macbook Air x 3 DECLINE D
  • 6. $2,450 Purchase Amount Washington (high income) Billing Neighborhood Platinum+ Credit Card Type Kibuve, Rwanda IP Location Virginia (Low income) Shipping Neighborhood < $100 Past Purchase Amounts Ivy League School Billing Address (202) 222-3850 Phone Sean is an officer, serving overseas in Rwanda. His shipping address matches the pattern of diplomat mailboxes for US embassies and his billing address is still linked to his old college accommodation. How did we know?     Sean Smith, Macbook Air x 3 BEHAVIORAL DATA REAL-TIME APPROVED
  • 7. Different systems, different SLAs     Performance   Data  Loss   Business  Logic   TX  Processing   Low  Latency   Nope   Best  Effort   Stream  Processing   High  Throughput   Best  Effort   Best  Effort   Batch  Processing   High  Volume   Nope   Reconcilia@on  
  • 8. Forter’s TX Processing API Forter   Transac@on  details     Decision   (approve  /  decline)   milliseconds  
  • 9. Achieving Low Latencies For  comparison:   •  A  few  network  hops  between  AWS  availability  zones  (~1ms)   •  JVM  young  garbage  collec@on  of  a  few  hundred  MBs  (~20ms)   •  Read  by  primary  key  from  SSD  storage  (~30ms)   •  US  West  Coast  to  East  Coast  network  latency  (~100ms)   •  JVM  full  garbage  collec@on  a  few  hundred  MBs   (~500ms-­‐1000ms)  
  • 11. Our stack   nginx  (lua)     expressjs  (nodejs)     Storm  (java)       Stability   Code   Changes  
  • 13.
  • 14. What do we monitor? •  Latencies   – Benchmark  tools  are  useless  here   – Always  prefer  percen@les,  measure  the  long  tail   – Dis@nguishing  real  traffic  from  synthe@c  requests     •  Availability   – No  non-­‐200  HTTP  status  codes  returned   •  High  /  suspicious  metrics   •  Errors,  excep@ons   •  Wrong  processing  (data  loss?)  
  • 19. •  Pingdom  probes  -­‐  low  coverage,  reliable   •  CloudWatch,  collectd  -­‐  fast,  no  root  cause   •  Internal  probes  -­‐  be1er  coverage,  false   alarms   •  libbeat  and  logstash  for  monitoring  3rd  party   applica@ons  –  some@mes  noisy,  false  alarms   •  App  events  (excep@ons)  -­‐  too  noisy,  root   cause  
  • 20. Redundancy of detection sources ….  is  important!   •  Different  concerns   •  Varying  accuracy  and  verbosity   •  High  availability   •  Correla@on  
  • 21. Automatic Latency Reporting: NodeJS georg  (named  aKer  Georg  Friedrich  Bernhard  Riemann)  is  a   nodejs  library  that  acts  as  an  in-­‐process  riemann  agent.  The   library  can  send  nodejs  specific  and  applica@on  specific   metrics  to  riemann.     h1ps://github.com/forter/georg    
  • 22.
  • 24. Automatic Latency Reporting: Apache Storm               h1ps://github.com/forter/riemann-­‐storm-­‐monitor    
  • 25. The ELK Stack Queue   (Redis)   “Shippers”   “Indexer”  
  • 26. The ELK Stack + Riemann Queue   (Redis)   “Shippers”  
  • 27. •  Detect  failures   •  Expect  different  latencies  from  different   components,  dynamically   •  Tolerate  spikes   •  High  /  low  watermarks   •  Reshape  event  streams  (split,  merge,  …)   Done  via  Riemann  –  an  event  stream  processor  
  • 29. ; Listen on the local interface over TCP (5555), UDP (5555) & websockets ; (5556) (let [host "127.0.0.1"] (tcp-server {:host host}) (udp-server {:host host}) (ws-server {:host host})) ; Expire old events from the index every 5 seconds. (periodically-expire 5) (let [index (index)] ; Inbound events will be passed to these streams: (streams (default :ttl 60 ; Index all events immediately. index ; Log expired events. (expired (fn [event] (info "expired" event))))))
  • 30. Filter tests using a state machine
  • 31. (tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91")) (defn pagerduty-test-dispatch [key] (let [pd (pagerduty key)] (changed-state {:init "passed"} (where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd))))) Filter tests using a state machine
  • 33. Re-open manually resolved alert (tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91")) (defn pagerduty-test-dispatch [key] (let [pd (pagerduty key)] (sdo (changed-state {:init "passed"} (where (state "passed") (:resolve pd))) (where (state "failed") (by [:host :service] (throttle 1 60 (:trigger pd)))))))
  • 35. Riemann’s Index Key  (host  +  service)   Event   TTL   10.0.0.1-­‐redisfree   {"metric":"5"}   60   10.0.0.1-­‐probe1   {"state":"failed"}   300   10.0.0.2-­‐probe1   {"state":"passed”}   300  
  • 36. Riemann Ghosts •  “Expired  events”   •  Use  case:  Detect  dead  services   •  Use  case:  Cron  stopped  working  (e.g.  backup)    
  • 37. Riemann Overloaded key  (IP  +  cookie)   event   TTL   199.25.1.1-­‐1234   {"state":"loaded"}   300   199.25.2.1-­‐4567   {"state":"downloaded"}   300   199.25.3.1-­‐8901     {"state":"loaded"}     300  
  • 38. Something failed, now what? •  Always  prefer  auto-­‐healing   – Humans  are  slow   – But  careful  from  a  domino  effect   •  Fail  fast  and  automa@c  failover   – Code  throws  excep@on,  logs  it  and  moves  forward   – Fatal  excep@ons  will  cause  the  process  to  restart  (upstart)   – Auto  scaling  groups  keep  enough  machines  up   – ELB  will  phase  out  dead  stacks   – HTTP  fencing  (commercial  products)  
  • 39. Alert •  When  human  interven@on  is  required   •  Urgent:  immediate  ac@on   – PagerDuty  (phone  calls,  SMS,  push  no@fica@ons)   •  Low  urgency:  events  are  logged  and   reviewed  the  morning  aKer   – PagerDuty  (Phone  push  no@fica@ons)   – Slack   – Email  
  • 41. The ELK stack For  real-­‐@me  analy@cs  and  dashboards  
  • 42. The ELK Stack + Riemann Queue   (Redis)   “Shippers”  
  • 43.
  • 44. Analyzing sleep patterns J *  Times  are  Israel  Day  Time  
  • 47. Alerting via anomaly detection
  • 48. To conclude… •  Fail  fast   – With  back-­‐off   •  Log  everything   •  Aler@ng  via  state  machines   •  By  host,  service,  @me   •  Alerts  are  useless  without  proper  debug  tools   •  Try  to  self-­‐heal  when  possible.  
  • 49. Thank  you   Ques>ons?   Itamar  Syn-­‐Hershko   h1p://code972.com   @synhershko  /  @ForterTech