SlideShare a Scribd company logo
1 of 22
Download to read offline
BUILDING A MISSION
CRITICAL EVENT SYSTEM
ON TOP OF MONGODB
by @shahar_kedar
BIGPANDA
SaaS platform that lets companies aggregate alerts
from all their monitoring systems into one place for
faster incident discovery and response.
HOW IT WORKS
High CPU on	

prod-srv-1	

18/06/14 16:05	

CRITICAL
High CPU on	

prod-srv-1	

18/06/14 16:07	

WARNING	

Memory usage on	

prod-srv-1	

18/06/14 16:08	

CRITICAL	

Events Entities
High CPU on	

prod-srv-1	

WARNING
Memory usage on	

prod-srv-1	

CRITICAL	

Incidents
2 Alerts on 	

prod-srv-1
PRODUCT REQUIREMENTS
• Events need to be processed into incidents and
streamed to the user’s browser as fast as possible 	

• Incidents need to reliably reflect the state as it is in
the monitoring system	

• The service has to be up and running 24x7
MISSION CRITICAL
• It’s not rocket science, it’s not Google, but:	

• It has to be super fast	

• It has to be extremely reliable	

• It has to always be available
OUR #1 COMPETITOR
WHY MONGO?
BECAUSE IT’S WEB SCALE!
WHY MONGO?
At first:	

• NodeJS shop	

• Schemaless	

• Easy to master	

Later on:	

• Reliable	

• Easy to evolve	

• Partial and atomic updates	

• Powerful query language
BECAUSE IT’S WEB SCALE!
SUPER FAST
Hardware
Schema Design
Lean & Stream
HARDWARE
03/13
3 x m1.medium
02/14
1 x i2.xlarge

+	

2 x m1.medium
m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive
06/14
2 x i2.xlarge

+	

1 x m3.xlarge
m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive
i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB
x3 reads
x4 writes
–Eliot Horowitz
“Schema design is … the largest factor when it comes
to performance and scalability … more important
than hardware, how you shard, or anything else,
schema is by far the most important thing.”
SCHEMA DESIGN
Event
{	

timestamp : Date	

status: String	

description: String,	

}	

Entity
{	

start : Date	

end: Date	

status: String	

description: String,	

events: [
<embedded>
]
source_system: String	

}	

Incident
{	

start : Date	

end: Date	

is_active: Boolean	

description: String,	

entities: [

{
entityId: ObjectId
status: String
}
]	

}
DENORMALIZATION
• Go over the checklist (http://bit.ly/1vUdz2T)	

• Incidents => Entities: partially embedded + ref	

• Cardinality: one-to-few	

• Direct access to Entities	

• Entities are frequently updated	

• Entities => Events: embedded	

• Events are not directly accessed	

• Events are immutable	

• Cardinality: one-to-many ~ one-to-gazzilion
INDEXES
• Optimized indexes 

db.collection.find({..}).explain()	

• Removed redundant indexes	

• Truncated events collections (TTL index)
LEAN QUERIES
• Use projections to limit fields returned by a query:

Model.find().select(‘-events’)	

• Mongoose users: use .lean() when possible to gain more
than 50% performance boost:

Model.find().lean()	

• Stream results: 

Model.find().stream().on(‘data’, function(doc){})

RESULTS
• Average latency of all API calls went from 500ms
to under 20ms	

• Average latency of full pipeline went from 2s to
under 500ms	

• Peak time latency of full pipeline went down from
5m(!!) to less than 30s
EXTREMELY
RELIABLE
Atomic & Partial Updates
ATOMIC & PARTIAL UPDATES
• Several services might try to update the same
document at the same time, but:	

• Different systems update different parts of the
document	

• Updates to the same document are sharded and
ordered at the application level 

(read our awesome blog post: http://bit.ly/1nQVcbS)
IMPOSSIBLETO
KILL
Replica Set
Disaster Recovery
REPLICA SET
• 3 nodes replica set	

• Using priorities to enforce master election of
stronger nodes	

• Deployed on different availability zones
DISASTER RECOVERY
• Cold backup using MMS Backup	

• Full production replication on another EC2 region:
using mongo’s replication mechanism to
continuously sync data to the backup region
THANKYOU!

More Related Content

What's hot

SplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin InternationalSplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin International
Splunk
 
LabGauge - LRIG Late Night
LabGauge - LRIG Late NightLabGauge - LRIG Late Night
LabGauge - LRIG Late Night
xi2elic
 
Capstone Poster Final Draft - 2
Capstone Poster Final Draft - 2Capstone Poster Final Draft - 2
Capstone Poster Final Draft - 2
Krishna Prasad A R
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
QAware GmbH
 
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
MLconf
 

What's hot (19)

SplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin InternationalSplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin International
 
Turning Cloud Metrics into Results
Turning Cloud Metrics into ResultsTurning Cloud Metrics into Results
Turning Cloud Metrics into Results
 
Efficient IT operations using monitoring systems and standardized tools - Ici...
Efficient IT operations using monitoring systems and standardized tools - Ici...Efficient IT operations using monitoring systems and standardized tools - Ici...
Efficient IT operations using monitoring systems and standardized tools - Ici...
 
LabGauge - LRIG Late Night
LabGauge - LRIG Late NightLabGauge - LRIG Late Night
LabGauge - LRIG Late Night
 
Monitoring via Datadog
Monitoring via DatadogMonitoring via Datadog
Monitoring via Datadog
 
Monitoring @ scale spot dy
Monitoring @ scale spot dyMonitoring @ scale spot dy
Monitoring @ scale spot dy
 
Combinación de logs, métricas y trazas para una observabilidad centralizada
Combinación de logs, métricas y trazas para una observabilidad centralizadaCombinación de logs, métricas y trazas para una observabilidad centralizada
Combinación de logs, métricas y trazas para una observabilidad centralizada
 
Capstone Poster Final Draft - 2
Capstone Poster Final Draft - 2Capstone Poster Final Draft - 2
Capstone Poster Final Draft - 2
 
Why Visibility into Your Stack Matters
Why Visibility into Your Stack MattersWhy Visibility into Your Stack Matters
Why Visibility into Your Stack Matters
 
Splunk Implementation and Usage - Garmin
Splunk Implementation and Usage - GarminSplunk Implementation and Usage - Garmin
Splunk Implementation and Usage - Garmin
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
 
Data torrent meetup-productioneng
Data torrent meetup-productionengData torrent meetup-productioneng
Data torrent meetup-productioneng
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Codemotion Milan 2015 Alerts Overload
Codemotion Milan 2015 Alerts OverloadCodemotion Milan 2015 Alerts Overload
Codemotion Milan 2015 Alerts Overload
 
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
 
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
 
SensorThings API webinar-#4-Connect Your Sensor
SensorThings API webinar-#4-Connect Your SensorSensorThings API webinar-#4-Connect Your Sensor
SensorThings API webinar-#4-Connect Your Sensor
 
Using static analysis tools within continuous integration systems
Using static analysis tools within continuous integration systemsUsing static analysis tools within continuous integration systems
Using static analysis tools within continuous integration systems
 
Cloud-native application monitoring powered by Riverbed and Elasticsearch
Cloud-native application monitoring powered by Riverbed and ElasticsearchCloud-native application monitoring powered by Riverbed and Elasticsearch
Cloud-native application monitoring powered by Riverbed and Elasticsearch
 

Similar to Building an event system on top MongoDB

Building Microservices with Scala, functional domain models and Spring Boot -...
Building Microservices with Scala, functional domain models and Spring Boot -...Building Microservices with Scala, functional domain models and Spring Boot -...
Building Microservices with Scala, functional domain models and Spring Boot -...
JAXLondon2014
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven Architectures
Avinash Ramineni
 

Similar to Building an event system on top MongoDB (20)

Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data Platform
 
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
 
Barista: Event-centric NOS Composition Framework for SDN
Barista: Event-centric NOS Composition Framework for SDNBarista: Event-centric NOS Composition Framework for SDN
Barista: Event-centric NOS Composition Framework for SDN
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
Building Autonomous Operations for Kubernetes with keptn
Building Autonomous Operations for Kubernetes with keptnBuilding Autonomous Operations for Kubernetes with keptn
Building Autonomous Operations for Kubernetes with keptn
 
2006 - Basta!: Advanced server controls
2006 - Basta!: Advanced server controls2006 - Basta!: Advanced server controls
2006 - Basta!: Advanced server controls
 
Sybase BAM Overview
Sybase BAM OverviewSybase BAM Overview
Sybase BAM Overview
 
Building Microservices with Scala, functional domain models and Spring Boot -...
Building Microservices with Scala, functional domain models and Spring Boot -...Building Microservices with Scala, functional domain models and Spring Boot -...
Building Microservices with Scala, functional domain models and Spring Boot -...
 
#JaxLondon: Building microservices with Scala, functional domain models and S...
#JaxLondon: Building microservices with Scala, functional domain models and S...#JaxLondon: Building microservices with Scala, functional domain models and S...
#JaxLondon: Building microservices with Scala, functional domain models and S...
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
OSDC 2018 | From Monolith to Microservices by Paul Puschmann_
OSDC 2018 | From Monolith to Microservices by Paul Puschmann_OSDC 2018 | From Monolith to Microservices by Paul Puschmann_
OSDC 2018 | From Monolith to Microservices by Paul Puschmann_
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven Architectures
 
Azure Event Grid: Glue for the Internet
Azure Event Grid: Glue for the InternetAzure Event Grid: Glue for the Internet
Azure Event Grid: Glue for the Internet
 
Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...
Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...
Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...
 
AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mob...
AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mob...AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mob...
AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mob...
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
Behavioral Analytics and Blockchain Applications – a Reliability View. Keynot...
Behavioral Analytics and Blockchain Applications – a Reliability View. Keynot...Behavioral Analytics and Blockchain Applications – a Reliability View. Keynot...
Behavioral Analytics and Blockchain Applications – a Reliability View. Keynot...
 
How to Create Observable Integration Solutions Using WSO2 Enterprise Integrator
How to Create Observable Integration Solutions Using WSO2 Enterprise IntegratorHow to Create Observable Integration Solutions Using WSO2 Enterprise Integrator
How to Create Observable Integration Solutions Using WSO2 Enterprise Integrator
 
Observability for Integration Using WSO2 Enterprise Integrator
Observability for Integration Using WSO2 Enterprise IntegratorObservability for Integration Using WSO2 Enterprise Integrator
Observability for Integration Using WSO2 Enterprise Integrator
 

Recently uploaded

Recently uploaded (20)

WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 

Building an event system on top MongoDB

  • 1. BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB by @shahar_kedar
  • 2. BIGPANDA SaaS platform that lets companies aggregate alerts from all their monitoring systems into one place for faster incident discovery and response.
  • 3. HOW IT WORKS High CPU on prod-srv-1 18/06/14 16:05 CRITICAL High CPU on prod-srv-1 18/06/14 16:07 WARNING Memory usage on prod-srv-1 18/06/14 16:08 CRITICAL Events Entities High CPU on prod-srv-1 WARNING Memory usage on prod-srv-1 CRITICAL Incidents 2 Alerts on prod-srv-1
  • 4. PRODUCT REQUIREMENTS • Events need to be processed into incidents and streamed to the user’s browser as fast as possible • Incidents need to reliably reflect the state as it is in the monitoring system • The service has to be up and running 24x7
  • 5. MISSION CRITICAL • It’s not rocket science, it’s not Google, but: • It has to be super fast • It has to be extremely reliable • It has to always be available
  • 8. WHY MONGO? At first: • NodeJS shop • Schemaless • Easy to master Later on: • Reliable • Easy to evolve • Partial and atomic updates • Powerful query language BECAUSE IT’S WEB SCALE!
  • 10. HARDWARE 03/13 3 x m1.medium 02/14 1 x i2.xlarge
 + 2 x m1.medium m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive 06/14 2 x i2.xlarge
 + 1 x m3.xlarge m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB x3 reads x4 writes
  • 11. –Eliot Horowitz “Schema design is … the largest factor when it comes to performance and scalability … more important than hardware, how you shard, or anything else, schema is by far the most important thing.”
  • 12. SCHEMA DESIGN Event { timestamp : Date status: String description: String, } Entity { start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String } Incident { start : Date end: Date is_active: Boolean description: String, entities: [
 { entityId: ObjectId status: String } ] }
  • 13. DENORMALIZATION • Go over the checklist (http://bit.ly/1vUdz2T) • Incidents => Entities: partially embedded + ref • Cardinality: one-to-few • Direct access to Entities • Entities are frequently updated • Entities => Events: embedded • Events are not directly accessed • Events are immutable • Cardinality: one-to-many ~ one-to-gazzilion
  • 14. INDEXES • Optimized indexes 
 db.collection.find({..}).explain() • Removed redundant indexes • Truncated events collections (TTL index)
  • 15. LEAN QUERIES • Use projections to limit fields returned by a query:
 Model.find().select(‘-events’) • Mongoose users: use .lean() when possible to gain more than 50% performance boost:
 Model.find().lean() • Stream results: 
 Model.find().stream().on(‘data’, function(doc){})

  • 16. RESULTS • Average latency of all API calls went from 500ms to under 20ms • Average latency of full pipeline went from 2s to under 500ms • Peak time latency of full pipeline went down from 5m(!!) to less than 30s
  • 18. ATOMIC & PARTIAL UPDATES • Several services might try to update the same document at the same time, but: • Different systems update different parts of the document • Updates to the same document are sharded and ordered at the application level 
 (read our awesome blog post: http://bit.ly/1nQVcbS)
  • 20. REPLICA SET • 3 nodes replica set • Using priorities to enforce master election of stronger nodes • Deployed on different availability zones
  • 21. DISASTER RECOVERY • Cold backup using MMS Backup • Full production replication on another EC2 region: using mongo’s replication mechanism to continuously sync data to the backup region

Editor's Notes

  1. For each customer: aggregate alert notifications from multiple monitoring systems group together alerts that belong to the same monitored appliance group together, into “incidents”, alerts that are (topo-)logically related