Observability in real time at scale

1
OBSERVABILITYand Intelligence
In Real time, at Scale
-By Balvinder Khurana & Sarang Shinde

2
Agenda
1. What is Observability
2. How it is different from Monitoring
3. Why do we need Observability
4. A typical observability pipeline
5. How observability enables intelligence
6. Comparison of tools
7. Questions

3
Discovering the information about System and User
Behaviour that leads to Customer and Business Impact.

4
Monitoring
What data does our system has
1. Facts
Things we are aware of and understand.
E.g . We are running spark jobs on ephemeral clusters.
1. Hypothesis
Things we are aware of but don’t understand
E.g. The VMs were preempted, causing user requests to fail.
1. Assumptions
Things we understand but are not aware of.
E.g. Data flow job will be able to handle increasing load automatically
1. Discoveries
Things we are neither aware of nor understand.
E.g. User leaving from particular page of website too early this is happening
because the microservice pod serving that page restarts many times due to jvm
heap error and memory limit issue.
Observability

5
Explosion of Data Sources
Explosion of Requirements

6
Too many tools capture vital information

7
Typical sources, Insights and Roles
Store Data Point KPI derived (eg.) Personas
DataDog Infrastructure Metrics -Deployment status
-Request/Response time* (micrometer
integration)
-Downtime for Service/Kafka
-Load on a service /Load balancer information
-Users affected due to Infra issues
-Developers
-Client Technical Team
-Product Owners
Kafka Application Events -Number of User
-Orders Placed
-Orders returned
-Revenue Generated
-Price Sensitivity
-CXOs
-Data Scientists
-Analysts
-Product Owners
EFK Application Exceptions
DB Exceptions
UI EXceptions
-Applications affected due to System issues
-Causes of exceptions
-Developers
-Client Technical Team
-Operations
-Security Champions
Istio
Network Rules/Service
Mesh
-Routing
-Service Availability
-Service traceability
-Developers
-Security Team
GTM Click Stream
-Customer Behavior
-Issues(UI) faced by Customer
-Device information
-Developers
-Product Owners
-Data Scientists
Jenkins Value stream -Path to Production -Developers

8
Data Sources
VMs
Elastic
New Relic
Prometheus
Data dog
Mongo
RDBMS
S3
Istio
GTM
Omniture
Observability
Pipeline
Data Sinks
HDFS
S3
GCS
OLAPs

10
Why…
Standard Specs?
Lambda/Kappa?
Raw layer?
Compaction?
Partitioning?
Delta Lake?

12
Streaming Technologies
Spark Kafka Flink
Processing Model Micro Batch One Record at a time One Record at a time
Deployment Own cluster, supports YARN,
Mesos, or containers
Library that any Java
application can embed.
Own cluster, supports YARN,
Mesos, or containers
Life Cycle Stream processing code is
deployed and run as a job in
the Spark cluster
Stream processing code
runs inside their
application
Stream processing code is
deployed and run as a job in the
Flink cluster
Typically Owned By Data infrastructure or BI
team
BI queries Data infrastructure or BI team
Coordination Yes No Yes
Source of continuous data Kafka, File Systems, other
message queues
Strictly Kafka , Other data
out of Kafka is a problem
Kafka, File Systems, other
message queues
Bounded and Unbounded Data
Streams
Avro, Parquet, JSON, CSV, ORC Text, SequenceFile, RCFile,
ORC, Parquet
Avro, Parquet, JSON, CSV
Semantics Exactly-once end-to-end with
specific Source and Sink
Exactly-once end-to-end
with Kafka
Exactly-once end-to-end with
specific Source and Sink

13
Querying Tools
SparkSQL Presto Drill Druid
Can query petabytes of
Data
Yes Yes Yes No
Used for Complex math, statistics,
ML intensive tasks
BI queries BI queries BI and real time analytics on
event driven data
Fault Tolerance Yes No Yes Yes
In memory processing Yes Yes Yes No
Processing speed Slower than Presto and
Drill
Faster than SparkSQL Faster than SparkSQL Faster for specific type of
Queries than Spark,Preso
and Drill.
File formats Avro, Parquet, JSON, CSV,
ORC
Text, SequenceFile,
RCFile, ORC, Parquet
Avro, Parquet, JSON, CSV Avro, Parquet, JSON, CSV
Schema-free querying
support
Yes No Yes No
Supports ANSI SQL Yes Yes Yes Subset
JDBC / ODBC Support Yes Yes Yes Yes
Performance benefits Catalyst and Tungsten Vectorized columnar
processing
Columnar execution and
Vector Processing
Columnar time based
segments, bitmap indexing

14
Reporting Tools
Tableau Looker
Apache Superset
(Incubating) Pentaho Metabase
Visualizations Drag and drop, SQL Drag and Drop
Spark, SQL, basic drag
and drop Drag and Drop Drag and drop, SQL
Intuitiveness and
Usability
Intuitive, Interactive &
Easy to Use Easy to Use
Intuitive, Interactive &
Easy to Use
Comparatively less ease
of use Easy to Use
Databases Supported
Natively supports all
well-known databases
Most of well-known
databases
Fewer databases (Druid
& DBs supporting SQL
Alchemy)
Fewer databases (JDBC
Compliant DBs,
MongoDB)
Most of well-known
databases
Security and access
control
Kerberos, SSPI, SAML,
OpenID, Active
Directory, LDAP, Local
etc.
Google OAuth, LDAP,
SAML, OpenID Flask AppBuilder (FAB)
Pentaho Security, LDAP,
Single Sign-On, Active
Directory, Kerberos Google OAuth, LDAP
Self Service Visualization Yes Yes Yes Yes Yes
Pricing $245 per month
$3,000 – $5,000 per
month Free
Subscription-based
pricing models Free
Data Science and ML
Support Predictive Analysis Advance analytics Predictive Analysis Predictive Analysis Analytical
Other
Better support for
Advanced analytics
and corresponding
data visualisation No support for OLAP
Advanced analytics is not
as mature as Tableau Easy setup and usability

Observability in real time at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Observability in real time at scale

Similar to Observability in real time at scale (20)

Recently uploaded

Recently uploaded (20)

Observability in real time at scale

Editor's Notes