The rise of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. This generates a continous stream of infrastructure data.
On the business side, we have started storing lot of data and this data contains enormours information, specially when married with infrastructure data this gives holistic health information of the entire platform. We will talk about how to achieve this kind of fine-grained observability at scale in real-time.
2. 2
Agenda
1. What is Observability
2. How it is different from Monitoring
3. Why do we need Observability
4. A typical observability pipeline
5. How observability enables intelligence
6. Comparison of tools
7. Questions
4. 4
Monitoring
What data does our system has
1. Facts
Things we are aware of and understand.
E.g . We are running spark jobs on ephemeral clusters.
1. Hypothesis
Things we are aware of but don’t understand
E.g. The VMs were preempted, causing user requests to fail.
1. Assumptions
Things we understand but are not aware of.
E.g. Data flow job will be able to handle increasing load automatically
1. Discoveries
Things we are neither aware of nor understand.
E.g. User leaving from particular page of website too early this is happening
because the microservice pod serving that page restarts many times due to jvm
heap error and memory limit issue.
Observability
7. 7
Typical sources, Insights and Roles
Store Data Point KPI derived (eg.) Personas
DataDog Infrastructure Metrics -Deployment status
-Request/Response time* (micrometer
integration)
-Downtime for Service/Kafka
-Load on a service /Load balancer information
-Users affected due to Infra issues
-Developers
-Client Technical Team
-Product Owners
Kafka Application Events -Number of User
-Orders Placed
-Orders returned
-Revenue Generated
-Price Sensitivity
-CXOs
-Data Scientists
-Analysts
-Product Owners
EFK Application Exceptions
DB Exceptions
UI EXceptions
-Applications affected due to System issues
-Causes of exceptions
-Developers
-Client Technical Team
-Operations
-Security Champions
Istio
Network Rules/Service
Mesh
-Routing
-Service Availability
-Service traceability
-Developers
-Security Team
GTM Click Stream
-Customer Behavior
-Issues(UI) faced by Customer
-Device information
-Developers
-Product Owners
-Data Scientists
Jenkins Value stream -Path to Production -Developers
12. 12
Streaming Technologies
Spark Kafka Flink
Processing Model Micro Batch One Record at a time One Record at a time
Deployment Own cluster, supports YARN,
Mesos, or containers
Library that any Java
application can embed.
Own cluster, supports YARN,
Mesos, or containers
Life Cycle Stream processing code is
deployed and run as a job in
the Spark cluster
Stream processing code
runs inside their
application
Stream processing code is
deployed and run as a job in the
Flink cluster
Typically Owned By Data infrastructure or BI
team
BI queries Data infrastructure or BI team
Coordination Yes No Yes
Source of continuous data Kafka, File Systems, other
message queues
Strictly Kafka , Other data
out of Kafka is a problem
Kafka, File Systems, other
message queues
Bounded and Unbounded Data
Streams
Avro, Parquet, JSON, CSV, ORC Text, SequenceFile, RCFile,
ORC, Parquet
Avro, Parquet, JSON, CSV
Semantics Exactly-once end-to-end with
specific Source and Sink
Exactly-once end-to-end
with Kafka
Exactly-once end-to-end with
specific Source and Sink
13. 13
Querying Tools
SparkSQL Presto Drill Druid
Can query petabytes of
Data
Yes Yes Yes No
Used for Complex math, statistics,
ML intensive tasks
BI queries BI queries BI and real time analytics on
event driven data
Fault Tolerance Yes No Yes Yes
In memory processing Yes Yes Yes No
Processing speed Slower than Presto and
Drill
Faster than SparkSQL Faster than SparkSQL Faster for specific type of
Queries than Spark,Preso
and Drill.
File formats Avro, Parquet, JSON, CSV,
ORC
Text, SequenceFile,
RCFile, ORC, Parquet
Avro, Parquet, JSON, CSV Avro, Parquet, JSON, CSV
Schema-free querying
support
Yes No Yes No
Supports ANSI SQL Yes Yes Yes Subset
JDBC / ODBC Support Yes Yes Yes Yes
Performance benefits Catalyst and Tungsten Vectorized columnar
processing
Columnar execution and
Vector Processing
Columnar time based
segments, bitmap indexing
14. 14
Reporting Tools
Tableau Looker
Apache Superset
(Incubating) Pentaho Metabase
Visualizations Drag and drop, SQL Drag and Drop
Spark, SQL, basic drag
and drop Drag and Drop Drag and drop, SQL
Intuitiveness and
Usability
Intuitive, Interactive &
Easy to Use Easy to Use
Intuitive, Interactive &
Easy to Use
Comparatively less ease
of use Easy to Use
Databases Supported
Natively supports all
well-known databases
Most of well-known
databases
Fewer databases (Druid
& DBs supporting SQL
Alchemy)
Fewer databases (JDBC
Compliant DBs,
MongoDB)
Most of well-known
databases
Security and access
control
Kerberos, SSPI, SAML,
OpenID, Active
Directory, LDAP, Local
etc.
Google OAuth, LDAP,
SAML, OpenID Flask AppBuilder (FAB)
Pentaho Security, LDAP,
Single Sign-On, Active
Directory, Kerberos Google OAuth, LDAP
Self Service Visualization Yes Yes Yes Yes Yes
Pricing $245 per month
$3,000 – $5,000 per
month Free
Subscription-based
pricing models Free
Data Science and ML
Support Predictive Analysis Advance analytics Predictive Analysis Predictive Analysis Analytical
Other
Better support for
Advanced analytics
and corresponding
data visualisation No support for OLAP
Advanced analytics is not
as mature as Tableau Easy setup and usability
Hypothesis:
A supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.
Replication of same for microservices
Also serverless or functions as a service
Tight coupling to operations
What happens when we want to change, how big it is to change tools, how easy it is to experiment with new ones
Decouple the data producers from consumer. (infrastructure from operational systems)
Host centric model to service centric model.
Pattern that can evolve while still having wins on the way
Maps to serverless architecture
Empowers teams in siloed organizations
What is Raw layer, its usage and advantages.
Building Raw layer with streaming source data introduces challenges such as too many small files. It puts burden on hdfs and hadoop ecosystem and makes difficult to manage.
Reducing small files at Raw and other layers can be done using compaction jobs.
Delta lake is open sourced by databricks it has build table like abstraction with ACID,versioning and time travel support on file system. Easy to use just change file format. Support hdfs,s3 as well.
Lambda or Kappa - Lambda is Speed + Batch layer. Kappa only thinks in terms of Streaming layer with claim of Batch is special case of streaming.
Lambda got criticism because maintaining different code base at batch and stream was difficult and this is because during that time different technologies
were used for batch and streaming. Now with advantant of apache beam, spark and flink you can write unified pipelines and with clean code practices you can make it less maintainable. Using kappa architecture and trying to fit every problem into streams has it own consequences such as you need to think on checkpointing, reprocessing partial data and it also create hurdles in upgrading your applications quickly. If you are starting new and don’t hold any prior streaming experience choose Lambda over Kappa and slowly used mix approach of both.
Lambda architecture is an approach to big data management that provides access to batch processing and near real-time processing with a hybrid approach. The basic architecture of Lambda has three layers: Batch, speed and serving. The batch layer, which typically makes use of Hadoop, is the location where all the data is stored. MapReduce runs regular batch processing on the totality of this data. This information is sent to a data store and is used to gain insights into historic data trends.
Alongside this slower layer, new data is captured and processed as it comes in. The speed layer provides business users with the ability to adjust decision making and respond quickly to rapidly emerging trends. Data that passes into this real-time layer is also copied into the larger data set for slower, batch processing. Once the real-time processing is complete, the data is cleared from the speed layer to clear the way for more incoming data. The real-time layer can operate efficiently even with a steady stream of complex data because it only has to handle the volume of data that comes in between rounds of batch processing.
The speed and batch layers are merged together for querying through the serving layer which features a massively parallel processing query engine. Having access to this combined data set helps ensure that accurate reporting is available at all times with low latency.
Standard Specs - As this platform will help multiple teams unification of structure of events and formats is important.
Explain lambda architecture.
What are possible tools/technologies in each layer.
Scalable, fault-tolerant stream to handle the petabytes of data generated.
benefit of using data visualization tools is it enable business people, data scientists and subject matter experts can participate in the application development process. Check skill and level of business users, their needs, also what amount and kind of data you often process keep these things in mind when choosing right tool.
Last thought is build this Incrementally one step at time no need to add complexity from beginning.
For small and limited number of applications in enterprise use traditional warehousing and reporting tools they serve well.
While incrementally building it you can build it for one small microservice/bounded context at a time.
Also if you are building with existing on prem infra then need coordination with other teams to do upfront capacity planning.
In cloud you can take advantage of isolation and serverless infra to build and get started with it.