202104 technical challenging and our solutions - golang taipei

Technical Challenging in
MMAU microservices
Ronald (hothero)

About me
● Tech Lead (SSEII) - Carousell@TW
● Senior SE - Uber@AMS
● SSE - Carousell@TW/SG
● CTO & Co-Founder @ Backer-Founder
● A girl’s father
● 在旋轉拍賣 carousell 一年看到的後端架構 -挑戰與生活 (11.5K claps)
● 2017 後端工程師面試以及準備經驗 (1.7K claps)

Snap, list, sell
List any item for sale in 30 seconds
In-app chat
Chat directly with sellers without
revealing personal information
Photo-centric
Core focus of the app on photos
Social
Share listings on social media channels
and join groups of people with similar
interests

In 8 years, we have grown to become
South East Asia’s #1 classiﬁeds marketplace
2012 2013 2014 2015 2016 2017 2018 2019
Founded
Carousell
in SG
Raised
USD800K
Series Seed
Raised USD6M
Series A
Launched in MY,
ID, TW
Launched
web platform
Launched HK
+ PH
Raised
USD35M
Series B
Acquired
Caarly,
launched Autos
+ Coins &
Bumps
+ Property
Launched
Spotlight +
Carousell
Protection
Acquired OLX PH,
Naspers invests,
received funding
from Telenor,
merged with 701
Search entities,
Mudah, Cho Tot
and OneKyat
Phase III
Verticalization & monetization
Phase I
Foundation
Raised
USD85M
Series C
Phase II
Internationalization
2020
Carousell
announces
US$80M
investment from
Naver, Mirae
Asset-Naver Asia
Growth Fund and
NH Investment &
Securities

8
markets
250
million
user listings
$850
million USD
in valuation

750+ teammates
19 nationalities
8 oﬃces

schematic diagram of architecture in 2021
gateway
listing
recom
search
third-parties
Logistics
order
offer
ads
90+
services
Payment
homescreen

Share our improvement first
● Around 1.8% error rate in one of our systems (Payment & shipping, 7+
services tied to each other) at the beginning of 2020 -> Bad UX
● `Something went wrong` to users if one of call in the chain failed.

As having more and more services, what challenging
● The SLA would be 95%, meaning 2160 minutes of downtime / month
● Overall 99% needs three-nine on each
● The reliability of every request made is really important
● Reasons of failure
○ Network issues
○ Capacity (DB/CPU/Memory/Concurrency/etc.)
○ Performance/Latency
○ Services connection sustainability
■ Infra-level (e.g. HAProxy, Envoy)
■ Application-level
○ Wrong logic/Bad code
gateway service4
service2
service1 service3
.99 .99 .99 .99 .99
*
* * *

Application-level - the request made
Hystrix
CallerContext
Load Balancer
Server
CalleeContext
Hystrix caps at 2 seconds,
but newrelic shows avg
latency can up to 5 seconds
(capped at infra level)
how Golang projects handle timeout

Common mistakes - timeout handling
Hystrix
CallerContext
Load Balancer
Server
CalleeContext
how Golang projects handle timeout
● Somehow passing
context.Background() in goroutine
jobs and blocking
● Didn’t set hystrix to cap long requests
● grpc.WithTimeout(timeout) is set for
dialing not general timeout of
requests
https://play.golang.org/p/Nrhu5nhvaa3

POC of standardizing gRPC dialing

POC of standardizing wrapper of handling timeout
Topics
● Alternatives on Retry
● Configuration Control

Observability + Workflow: maintain the level of
reliability
gateway service2
service1
Hyst
rix
Hyst
rix
Hyst
rix
Distributed tracing
Error reporting
Main alerting (Programmability)
Monitoring (alert channels)
Workflow - Support Rotation/Weekly
Sync

Case study about the importance of context
passing and tracing
● High error rate alert of one endpoint on gateway
● Same error spike on service 1 but not on service 4
● No proper egress metrics on service 1 and 2 (e.g.
hystrix) but seeing error rate on gRPC server
● Based on the good enough evidence, checked with the
corresponding team. Confirmed it’s the outage related.
● If we have had tracing and appropriate context passing,
we would have noticed the root cause on Jaeger very
quickly.
● Demo
gateway
service1
service2
service3
service4
service5

The type of business and product of Carousell
determines the focus
● Core and synchronous flows in Carousell
○ Homefeed
○ Search
○ Make an offer
○ Chat
○ Make an order
○ ...
● There are still a lot of configurations/components could impact overall
reliability, why mostly about wrapper and communications
○ Read volume is larger than write traffic
○ Statibility is more important than performance for now
○ Browsing is the majority of user behaviors

Introducing service-mesh with multi-tenancy
● Technical challenging as having more and more microservices and teams
○ Staging environment is very unstable
○ Productivity of all types of testing is needed to be improved.
○ The needs of built-in routing is increasing
Reference: https://eng.uber.com/multitenancy-microservice-architecture/

Metadata of gRPC in Golang
● HTTP Headers/gRPC Metadata -> incoming context -> outgoing context -> ...
https://github.com/grpc/grpc-go/blob/master/Documentation/grpc-metadata.md,
https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md
https://github.com/carousell/Orion/blob/master/interceptors/interceptors.go#L214-L229

Conclusion
● Context is the critical component in Golang projects
○ Metadata propagation
○ Distributed tracing
○ Service Mesh with tenancy
○ Timeout handling
○ …
● From Golang document
○ Incoming requests to a server should create a Context, and outgoing calls to servers should
accept a Context. The chain of function calls between them must propagate the
Context, optionally replacing it with a derived Context created using WithCancel,
WithDeadline …... Do not store Contexts inside a struct type; instead, pass a Context
explicitly to each function that needs it. The Context should be the first parameter,
typically named ctx
● Service-mesh plays a role in micro-service architecture.
● We’re hiring, join us if you’re also interested in the journey :)

目前職缺 Current Position
Transactional GC Team
Sr. Software Engineer, Backend
Buyer Experience Team
Sr. Software Engineer, Frontend
資深後端工程師資深前端工程師

202104 technical challenging and our solutions - golang taipei

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 202104 technical challenging and our solutions - golang taipei

Similar to 202104 technical challenging and our solutions - golang taipei (20)

Recently uploaded

Recently uploaded (20)

202104 technical challenging and our solutions - golang taipei