How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to Manage Fast-Changing API Ecosystems

1
How FinTech Innovator Razorpay
Uses Open-Source Tracing
And Observability to Manage
Fast-Changing API Ecosystems
Transcript of a discussion on an open-source project, Hypertrace, and how it helps designers, builders,
and testers of modern APIs gain visibility across their internal and third-party services.
Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Traceable AI.
Dana Gardner: Hi, this is Dana Gardner, Principal Analyst at Interarbor Solutions, and you’re
listening to BriefingsDirect.
The speed and complexity of microservices-intense applications often leave their developers in
the dark. They too often struggle to track and visualize the actual underlying architecture of
these distributed services.
The designers, builders, and testers of modern API-driven apps, therefore, need an ongoing
and instant visibility capability into the rapidly changing data flows, integration points, and
assemblages of internal and third-party services.
Thankfully, an open-source project to advance the sophisticated distributed tracing and
observability platform called Hypertrace is helping.
Stay with us now as we hear about the evolution and capabilities of Hypertrace and how an
early adopter in the online payment suite business, Razorpay, has gained new insights and
deeper understanding of their services components.
To learn how Hypertrace discovers, monitors, visualizes,
and optimizes increasingly complex services architectures,
please welcome Venkat Vaidhyanathan, Architect at
Razorpay in Bangalore, India. Welcome, Venkat.
Venkat Vaidhyanathan: Thank you, Dana, for the warm
welcome.
Gardner: We’re also here with Jayesh Ahire, Founding
Engineer at Traceable AI and Product Manager for
Hypertrace. Welcome, Jayesh.
Jayesh Ahire: Thanks, Dana. Glad to be here.
Gardner: Venkat, what does Razorpay do and why is
tracing and understanding your services architecture so important?
Vaidhyanathan

2
Built by developers, for developers
Venkat: Razorpay’s mission is to enable frictionless banking and payment experiences by
powering the entire financial infrastructure for businesses of all shapes and sizes. It’s a full-
stack financial solution that enables thousands of small- to medium-sized enterprises (SMEs)
and enterprises to accept, process, and disburse payments at scale.
Today, we process billions of dollars of payments from millions of businesses across India. As a
leading payments provider, we have been the first to bring to market most of the major online
innovations in payments for the last five years.
For the last two years, we have successfully curated neo banking and lending services. We
have seen outstanding growth in the last five years and attracted close to $300 million-plus in
funding from investors such as Sequoia, Tiger Global, Rebate, Matrix Partners, and others.
One of the fundamental principles about designing Razorpay
has been to build a largely API-driven ecosystem. We are a
developer-first company. Our general principle of building is,
“It is built by developers for developers,” which means that
every single product we build is always going to be API-driven
first. In that regard, we must ensure that our APIs are resilient.
That they perform to the best and most optimum capacity is of
extreme importance to us.
Gardner: What is it about being an API-driven organization that makes tracing and observability
such an important undertaking?
Learn More
About Traceable AI.
Venkat: We are an extremely Agile organization. As a startup, we have an obsession around
our customers. Focus on building quality products is paramount to creating the best user
experience (UX).
Our customers have amazing stories around our projects, products, and ecosystem. We have
worked through extreme times (for example, demonetization, and the Yes Bank outage), and
that has helped our customers build a lot of trust in what we do -- and what we can do.
We have quickly taken up the challenge and turned the tables for most of our customers to build
a lot of trust in the kinds of things we do.
After all, we are dealing with one of the most sensitive aspects of human lives, which is their
money. So, in this regard, the resiliency, security, and all the useability parameters are
extremely important for our success.
Gardner: Jayesh, why is Razorpay a good example of what businesses are facing when it
comes to APIs? And what requirements for such users are you attempting to satisfy with your
distributed tracing and observability platform?
Every single product that
we build is always going
to be API-driven first.
We must ensure that our
APIs are resilient.

3
Observability offers scale, insight, and resilience
Ahire: Going back to the days when it all started, people
began building applications using monoliths. And it was
easier then to begin with monolithic applications to get the
business moving.
But in recent times, that is not the only important thing for
businesses. As we heard, Venkat needs scale and
resiliency in the platform while building with APIs. Most
modern organizations use microservices, which
complicates these modern architectures. They become
hard to manage, especially at large-scale organizations
where you can have 100 to 300 microservices, with
thousands of APIs communicating between those
microservices.
It’s just hard now for businesses to have visibility and observability to determine if they have any
issues and to see if the APIs are performing as they are expected.
I use a list of four brief questions that every organization needs to answer at some point. Are
their APIs:
● Providing the functionality they are supposed to deliver?
● Performing in the way they are supposed to?
● Secure for their business users?
● Understood across all their APIs and microservices uses?
They must understand if the APIs and microservices are performing up to the actual
expectations and required functionality. They need something that can provide the answers to
these questions, at the very least.
Observability helps answer these essential questions without having to open the black box and
go to each service and every API. Instead, the instrumentation data provides those insights.
You can ask questions of your system and it will give you the answers. You can ask, for
example, how your system is performing -- and it will give you some answers. Such
observability helps large-scale organizations keep up with the scale and with the increasing
number of users. And that keeps the systems resilient.
Gardner: Venkat, what are your business imperatives for using Hypertrace? Is it for UX? What
is the business case for gaining more observability in your services development?
Metrics, logs, and traces together control trouble
Ahire

4
Venkat: There are three fundamental legs to what we define as modern observability. One part
is with respect to metrics, the next part has to do with the logs, and the third part is in respect to
the traces.
Up until recently, we had application performance monitoring (APM) systems that monitored
some of these things, with a single place to gather some metrics and insights. However, as
microservices grew wider in use, APMs are no longer necessarily the right way to do these
things. For such metrics, a lot of work is already going on in the open-source ecosystem with
respect to Prometheus and others. I wrote a blog about our journey into scaling our metrics
platform to trillions of data points.
Once you can get logs -- whether it is from open-source ELK Stack [Elasticsearch, Logstash,
and Kibana], or whether it is from a lot of platform as a service (PaaS) and software as a service
(SaaS) log providers -- fundamentally the issue comes down to traces.
Now, traces can be visualized in a very primitive way, such as for instrumenting a particular
piece of code to understand its behavior. It could be for a timing function, for example.
However, as microservices evolve, you’re talking about a lot more problems, such as how much
time would a network call take? How much time would the database call take? Was my DNS
request the biggest impediment? What really happened in the last mile?
And when you’re talking about an entire graph of services, it’s very important to know what
particular point in the entire graph breaks down often – or doesn’t break down very often.
Understanding all these things, as Jayesh said, and asking the right questions cannot happen
only by using metrics or just logs. They only give different slices of the problems. And it cannot
happen only by using tracing, which also only gives a different slice of the problem.
In an ideal, nirvana world, you need to combine all these things and create a single place that
can correlate these various things and allow a deep dive with respect to a specific component,
module, function, system, query, or whatever. Being able to identify root causes and the mean
time to detect (MTTD), these are some of the most paramount things that we probably need to
worry about.
In complex, large-scale systems, things go
wrong. Why things went wrong is one part,
when did things go wrong is another part,
and being able to arrive and fix things – the
MTTD and the mean time to recovery
(MTTR) -- those largely define the success
of any business.
We are just one of the many financial ecosystem providers. There are tons of providers in the
world. So, the customer has many options to switch from one provider to another. For any
business, how they react to these performance issues is the most important.
Observability tools like Hypertrace puts us in control, rather than just leaving it for hypothesis.
Why things went wrong is one part, when
did things go wrong is another part, and
being able to arrive and fix things … those
largely define the success of any business.

5
Gardner: Jayesh, how does Hypertrace improve on such key performance controls as MTTD
and MTTR? How is Hypertrace being used to cut down on that all important time to remediation
that makes the user experience more competitive?
Tracing adds ease to uncovering the unknown
Ahire: As Venkat pointed out, in these modern systems, there are too many unknown
unknowns. Finding out what caused any problem at any point in time is hard.
At Hypertrace, in trying to help businesses, we present entity-focused, API-first views.
Hypertrace provides a very detailed service dashboard, an overview, an out-of-the-box service
overview. Such a backend API overview helps find what different services are talking to each
other, how they are talking to each other, the interactions between the different services, and
then what different APIs are talking to the services. It provides a list of APIs.
Hypertrace provides a single pane view into the services and API trace data. The insights
gained from the trace data makes it easier to find which API or service has some issue. That’s
where the entity-first API view makes the most sense. The API dashboard helps people get to
the issue very easily and helps reduce the MTTD and MTTR.
Learn More
About Traceable AI.
Venkat: Just to add to what Jayesh mentioned, in our world our ecosystem is internally a
Kubernetes ecosystem. And Kubernetes is extremely dynamic in nature. You’re not anymore
dealing with single, private IDs or public IDs, or any of those things. Services can come up.
Parts can come up. Deployments can come up, go down.
So, service discoverability becomes a problem,
which means that tying back a particular
behavior to these services, which are
themselves a collection of services, and to the
underlying infrastructure -- whether you’re
talking about queues or network calls -- you’re
talking about any number of interconnected
infrastructure components as well. That
becomes extremely challenging.
The second aspect is implicitly most of our ecosystems run on preemptive workloads, or smart
workloads. So, nodes can come up, nodes can go down. How do you put these things together?
While we can identify a particular service as problematic, I want to find out if it is the service that
is problematic or the underlying cloud provider. And within the cloud provider, is it the network or
the actual hardware or operating system (OS)? If it is OS, which part precisely? Is it just a
particular part that is problematic, or is the entire hardware problematic? That’s one view.
The other view is that cardinality becomes an extremely important issue. Metrics alone cannot
solve that problem. Logs alone cannot solve that problem. A very simple request, for example, a
payment-create-request in our world, carries at least 30 to 35 different cardinality dimensions
Tying back a particular behavior to these
services, which are themselves a
collection of services, and to the
underlying infrastructure … you’re talking
about any number of interconnected
infrastructure components as well.

6
(e.g.: the merchant identity, gateway, terminal, network, and whether the payment is domestic
vs international, etc.).
A variety of these parameters comes into play. You need to know if it’s an issue overall, is it at a
particular merchant, and at what dimension? So, you need to narrow down the problem in a
tight production scenario.
To manage those aspects, tools like Hypertrace, or any observability tool, for that matter --
tracing in general -- makes it a lot easier to arrive at the right conclusions.
Gardner: You mentioned there are other options for tracing. How did you at Razorpay come to
settle on Hypertrace? What’s the story behind your adoption of Hypertrace after looking at the
tracing options landscape?
The why and how of Razorpay choosing Hypertrace
Venkat: When we began our observability journey, we realized we had to go further into
visibility tracing because the APMs were not answering a lot of questions we were asking of the
APM tool. The best open-source version was that offered by Jaeger. We evaluated a lot of
PaaS/SaaS solutions. We really didn't want to build an in-house observability stack.
There were a few challenges in all the PaaS offerings including storage, ability to drill down,
retention, and cost versus value offered. Additionally, many of the providers were just giving us
Jaeger with add-ons. The overall cost-to-benefit ratio suffered because we were growing with
both the number of services and users. Any model that charges us on the user level, data
storage level, or services level -- these become prohibitive over time.
Although maintaining an in-house observability tool is not the most natural business direction for
us, we soon realized that maybe it’s best for us to do it in-house. We were doing some research
and hit upon this solution called Hypertrace. It looked interesting so we decided to give it a try.
They offered the ability for me to jump into a Slack call. And that’s all I did. I just signed up. In
fact, I didn’t even sign up with my company email address. I signed up with my personal email
address and I just jumped on to their Slack call.
I started asking the Hypertrace team lots of questions.
Started with a Docker-compose, straight out of their GitHub
repo. The integration was quite straightforward. We did a
set of proof-of-concepts and said, “Okay, this sort of makes
sense.” The UX was on par with any commercial SaaS
provider. That blew my mind. How can an open-source
product build such a fantastic user interface (UI)? I think
that was the first thing that hit most of our heads. And I think that was the biggest sell. We said,
“Let’s just jump in and see how it evaluates.” And that’s the story.
Gardner: What sort of paybacks or metrics of success have you enjoyed since adopting
Hypertrace? As open source, are you injecting your own requirements or desired functions and
features into it?
How can an open-source
product build such a
fantastic user interface? …
That was the biggest sell.

7
Venkat: First and foremost, we wanted to understand the beast we were dealing with in our
APIs, which meant we had to build in the instrumentation and software development kits
(SDKs), including OpenCensus, OpenTracing, and OpenTelemetry agents.
The next step was integrating these tools within our services and ecosystem. There are
challenges in terms of internally standardizing all our instrumentation, using best practices, and
ensuring that applications are adopted. We had to make internal developer adoption easier by
building the right toolkits, the right frameworks, and the right SDKs because applications have
their own business asks, and you shouldn’t be adding woes to their existing development life
cycle. Integration should be simple! So, we formulated a virtual team internally within Razorpay
to build the observability stack.
As we built the SDKs and tooling and started instrumenting, we did a lot of adoption exercises
within the organization. Now, we have more than 15 critical services and a lot more in the
pipeline. Over a period of time, we were able to make tracing a habit rather than just another
“nice to have.”
One of the biggest benefits we started seeing from the production monitoring is our internal
engineering teams figured out how to run performance tests in pre-production. Some of these
wouldn’t have been possible before; being able to pin down the right problem areas.
Now, during the performance testing, our
engineers can early-on pinpoint the root cause
of the problems. And they’ve gone back to fix
their code even before the code goes into
production. And believe me that it’s a lot more
valuable for us than the code going into
production and then facing these problems.
The misfortune about all monitoring tools is typical metrics might not be applicable. Why?
Because when things go right, nobody wants to look at monitoring. It’s only when things go
wrong that people log into a monitoring tool.
The benefits of Hypertrace come in terms of how many issues you’re able to detect much earlier
in the stages of development. That’s probably the biggest benefit we have gotten.
Gardner: Jayesh, what makes Hypertrace unique in the tracing market?
Democratic data collection delivers API analytics
Ahire: There are two different ways to analyze, visualize, and use the data to better
understand the systems. The first important thing is how we do data collection. Hypertrace
provides data collection from any standard instrumentation.
If your application is instrumented with Jaeger, Zipkin, or OpenTelemetry, and you start sending
the instrumentation data to Hypertrace, it will be able to analyze it and show you the dashboard.
You then will be able to slice and dice the data using our explorer. You can discover a lot of
different things.
Now, during the performance testing,
our engineers can early-on pinpoint the
root cause of the problems. And they’ve
gone back to fix their code even before
the code goes into production.

8
That democratization of the data collection aspect is one important thing Hypertrace provides.
And if you want to use any other tracing platform you can do that with Hypertrace because we
support all the standard instrumentation.
Next is how we utilize that data. Most tracing platforms provide a way to slice and dice their
data. So that’s just one explorer view where there’s all the data from the instrumentation
available and you can find the information you want. Ask the question and then you will get the
information. That’s one way to look at it.
Hypertrace provides, in addition to that
explorer view, a detailed service graph. With it,
you can go to applications, see the service
interactions, the latency markings, and learn
which services are having errors right away.
Out-of-the-box services derived from
instrumentation data provide many necessary
metrics and visualizations, including latency,
error rate, and call rate.
You can see more of the API interactions. You can see comparison data to current data, for
example. Whatever your latency was in the last one day to the last hour. It provides you a
comparison for that. And it’s pretty helpful by being able to compare between deployments,
such as if the performance, latency, or error rate is affected. There are a lot of use cases you
can solve with Hypertrace.
With such observability used in early problem detection, you can reduce MTTD and MTTR using
these dashboard services. You can achieve early problem detection easily.
Learn More
About Traceable AI.
Then there’s availability. The expectation is for availability of 99.99 percent. In the case of
Razorpay, it’s very critical. Any downtime has a business impact. For most businesses, that’s
the case. So, availability is a critical issue.
The Hypertrace dashboards help you to maintain that as well. Currently, we are working on
alerting features on deviations -- and those deviations are calculated automatically. We
calculate baselines from the previous data, and whenever a deviation happens, we give an
alert. That obviously helps in reducing MTTD as well as increasing availability generally.
Hypertrace strives to make the UX seamless. As Venkat mentioned, we have a beautiful UI that
looks professional and attractive. The UI work we put into our SaaS security solution, Traceable
AI, this functionality also goes into Hypertrace, and so helps the community. It helps people
such as Venkat at Razorpay to solve the problems in their environment. That’s pretty good.
Gardner: Venkat, for other organizations facing similar complexity and a need to speed
remediation, what recommendations do you have? What should other companies be thinking
about as they evaluate observability and tracing choices? What do you recommend they do as
they get more involved with API resiliency?
Hypertrace provides, in addition to that
explorer view, a detailed service graph.
With it, you can go to applications, see
the service interactions, the latency
markings, and learn which services are
having errors right away.

9
Experiment, evaluate, and then invest in your journey
Venkat: A fundamental problem today in the open-source world with tracing is the quality of
standards. We have OpenCensus on one side going to OpenTelemetry and OpenTracing going
to OpenTelemetry. In trying to keep it all compatible, and because it’s all so nascent, there is not
a lot of automation.
For most startups, it is quite daunting to build their own observability stack.
My recommendation is to start with an existing tracing
provider and evaluate that against your past
solutions. Over time it may become cost prohibitive.
At some point, you must start looking inward. That’s
the time when systems like Hypertrace become quite
useful for an organization.
The truth is it’s not easy to build on an observability stack. So, experiment with a SaaS provider
on a lower scale. Then invest in the right tooling, one that gives the liberty to not maintain the
stack, such as Hypertrace. Keep the internal tooling separate, experiment, and come back.
That’s what I would recommend.
The cost is not just the physical infrastructure cost, or the licensing cost. Cost is also
engineering cost of the stack. If the stack goes down, who monitors the monitor? It’s a big
question. So, there are trade-offs. There is no right answer, but it’s a journey.
After our experience with Hypertrace, I have connected with a couple of my friends in different
organizations, and I’ve told them of the benefits. I do not know their results, but I’ve told them
some of the benefits that we have leveraged using Hypertrace.
Gardner: And just to follow up on your advice for others, Venkat, what is it about open source
that helps with those trade-offs?
Venkat: One advantage we have with open-source is there is no vendor lock-in. That’s one
major advantage. One of our critical services is in PHP. And hence, we needed to only use
OpenCensus for instrumenting it.
But there were a lot of performance and resilience issues with this codebase. Today, the original
OpenCensus PHP implementation points to Razorpay’s fork.
And we are working with the Hypertrace community, too, to build some features, whether it is in
tool design, Blue Coat, knowledge sharing, and bug-fixing. For us it’s been an interesting and
exciting journey.
Ahire: Yes, that has been the mutual experience from our end as well. We learned a lot of
things. We had made assumptions in the beginning about what users might expect or want.
But Razorpay worked with us. On some things they said, “Okay, this is not going to work. You
have to change this part.” And we modified some things, we added a few features, and we
Start with an existing tracing provider
and evaluate that against your past
solutions. … At some point, you must
start looking inward.

10
removed a few things. That’s how it came to where it is today. The whole collaboration aspect
has been very rewarding.
Venkat: Even though we have a handful of critical services, the data that are instrumented from
them, it was over two terabytes a day. And while that is a good problem to have, we have other
interesting scaling challenges we need to deal with.
So how do you optimize these things at scale? In the SaaS form, we could have just gone and
said, “Hey, this sort of doesn’t work.” We stick with them for a few months then we go ahead
with another SaaS provider and say, “Are you going to solve this problem or not?”
The flexibility we get with open source is to say, “Okay, here’s the problem. How do we fix it?”
Because, of course, they’re not under our control, right? I think that’s super powerful.
Ahire: Here we all learn together.
Gardner: Yes, it certainly sounds like a partnership relationship. Jayesh, tell us a little bit about
the roadmap for Hypertrace, and particularly for the smaller organizations who might prefer a
SaaS model, what do you have in store for them?
Learn More
About Traceable AI.
Ahire: We are currently working on alerting. We’ll soon release dynamic anomaly-based
alerting.
We are also working on metric ingestion and integrations throughout the Hypertrace platform.
An important aspect of tracing and observability is being able to correlate the data. To
propagate context throughout the system is very important. That’s what we will be doing with
our metric integration. You will be able to send application metrics, and you will be able to
correlate back to base data and log data.
And talking of SaaS, when it comes to smaller organizations with maybe 10, 20, or 30
developers and a not very well-defined DevOps team, it can be hard to deploy and manage this
kind of platform.
So, for those users, we are working toward a
SaaS model so smaller companies will be able to
use the Hypertrace stack functionality.
Gardner: Where can organizations go to learn more about Hypertrace and start to use some of
these features and functions?
Ahire: You can head on to hypertrace.org, our website, and find the details of our use cases.
There’s a Slack channel link, GitHub, and everything is available there. Those are good places
to start.
Venkat: Just try it first and just go to GitHub and within a few minutes you should have the
entire stack up and running. I mean, that’s as simple as simplicity can get.
We are working toward a SaaS model
so smaller companies will be able to
use the Hypertrace stack functionality.

11
For further details, just go to the Slack channel and start communicating. Their team is super-
duper responsive and super-duper helpful. In fact, we have never had to talk to them saying,
“Hey, what’s this?” because we sort of realized that they come back with a patch much faster
than you can imagine.
Gardner: I’m afraid we’ll have to leave it there. You’ve been listening to a sponsored
BriefingsDirect discussion on how the speed and complexity of microservices-laden applications
can often leave developers in the dark as to what’s going on with their underlying dynamic
service architectures.
And we’ve learned how a sophisticated, distributed tracing and observability platform called
Hypertrace discovers, monitors, visualizes, and optimizes services for an innovative online
payments business, Razorpay.
So, a big thank you to our guests, Venkat Vaidhyanathan, Architect at Razorpay in Bangalore,
India. Thank you so much, Venkat.
Venkat: Thank you, Dana, for the opportunity, and thank you, Jayesh, and the Hypertrace team
for helping us to build and make our systems far more robust.
Gardner: We’ve also been here with Jayesh Ahire, Founding Engineer at Traceable AI and
Product Manager for Hypertrace. Thank you, Jayesh.
Ahire: Thanks, Dana. It was great talking to you and sharing our story.
Gardner: And a big thank you as well for our audience for joining this BriefingsDirect API
resiliency discussion. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host
throughout this series of Traceable AI-sponsored BriefingsDirect interviews.
Thanks again for listening. Please pass this along to your business community and do come
back for our next chapter.
Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Traceable AI.
Transcript of a discussion on an open-source project, Hypertrace, and how it helps designers, builders,
and testers of modern APIs gain visibility across their internal and third-party services. Copyright
Interarbor Solutions, LLC, 2005-2021. All rights reserved.
You may also be interested in:
● Introducing Hypertrace Java Agent | Hypertrace
● Getting started with Hypertrace on AWS EKS | Hypertrace
● Yet another Go Agent | Hypertrace
● Introducing Hypertrace | Hypertrace
● How to migrate your organization to a more security-minded culture
● How API security provides a killer use case for ML and AI
● Securing APIs demands tracing and machine learning that analyze behaviors to head off attacks
● Rise of APIs brings new security threat vector -- and need for novel defenses
● Learn More About the Technologies and Solutions Behind Traceable AI.
● Three Threat Vectors Addressed by Zero Trust App Sec

12
● Web Application Security is Not API Security
● Does SAST Deliver? The Challenges of Code Scanning.
● Everything You Need to Know About Authentication and Authorization in Web APIs
● Top 5 Ways to Protect Against Data Exposure
● TraceAI : Machine Learning Driven Application and API Security

How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to Manage Fast-Changing API Ecosystems

More Related Content

What's hot

Similar to How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to Manage Fast-Changing API Ecosystems

Recently uploaded

How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to Manage Fast-Changing API Ecosystems