GeeCON Microservices 2015 scaling micro services at gilt

scaling μ-services at
Gilt
ade@gilt.com
Sopot, Poland
11th September 2015
Adrian Trenaman, SVP Engineering, Gilt,
@adrian_trenaman
@gilttech

why was I late today?
and…
were micro-services to blame?

svc-localised-string
mongodb
login-reg mosaic
product
listing
product
search
product
search
A localisation file was loaded
with an character encoding
The driver spun on CPU, consuming
CPU credits
The service starved and fell over.
Core parts of the site were broken

so…
… how did I really feel
about micro-services yesterday?

gilt: luxury designer brands at discounted prices

we shoot the product in our studios

we receive, store, pick, pack and ship...

this is what the stampede really looks like...

rails to riches: 2007 - ruby-on-rails monolith

2011: java, loosely-typed, monolithic services
Hidden
linkages; buried
business logic
Monolithic Java
App; huge
bottleneck for
innovation.
lots of
duplicated code
:(
teams focused on
business lines
Large loosely-
typed
JSON/HTTP
services

enter: µ-services
“How can we arrange our teams around
strategic initiatives? How can we make it
fast and easy to get to change to
production?”

driving forces behind gilt’s emergent
architecture
● team autonomy
● voluntary adoption (tools, techniques,
processes)
● kpi or goal-driven initiatives
● failing fast and openly
● open and honest, even when it’s difficult

service growth over time: point of inflexion === scala.

what are all these services doing?

anatomy of a gilt service - typical choices
gilt-service-framework,
log4j, cloudwatchCave,
, java, javascript
or

service discovery: straight forward
zookeeper
Brocade Traffic Manager
(aka Zeus, Stringray,
SteelApp,...)

single tenant deployment: one AMI per service instance

reproducible, immutable deployments: docker

service discovery: new services use ELB
zookeeper
Amazon ELB

lift’n’shift + elastic teams
Existing Data Centre
dual 10Gb direct connect line, 2ms latency

evolution of architecture and tech organisation

Lessen dependencies
between teams: faster code-
to-prod
Lots of initiatives in parallel
Your favourite
<tech/language/framework>
here
We (heart) μ-services
Graceful degradation of
service
Disposable Code: easy to
innovate, easy to fail and
move on.

We (heart) cloud
Do devops in a
meaningful way.
Low barrier of entry for
new tech (dynamoDB,
Kinesis, ...)
Isolation
Cost visibility
Security tools (IAM)
Well documented
Resilience is easy
Hybrid is easy
Performance is great

seven μ-service
challenges
(& some solutions)
no one ever said this was gonna be easy

1. staging vs test-in-prod
We find it hard to maintain staging environments across
multiple teams with lots of services.
● We think TiP is the way to go: invest in automation, use
dark canaries in prod.
● However, some teams have found TiP counter-
productive, and use minimal staging environments.

2. ownership
Who ‘owns’ that service? What happens if that
person decides to work on something else?
We have chosen for teams and departments to
own and maintain their services. No throwing
this stuff over the fence.

1. Software is owned by
departments, tracked in
‘genome project’. Directors
assign services to teams.
2. Teams are responsible for
building & running their
services; directors are
accountable for their overall
estate.
bottom-up ownership, RACI-style

‘ownership donut’ informs tech strategy
3. Ownership is classified:
active, passive, at-risk.
‘done’ === 0% ‘at risk’

3. deployment
Services need somewhere to live. We’ve open-sourced
tooling over docker and AWS to give:
elasticity + fast provisioning + service isolation
+ fast rollback
+ repeatable, immutable deployment.
https://github.com/gilt/ionroller

4. lightweight APIs
We’ve settled on REST-style APIs, using http:
//apidoc.me. Separate interface from
implementation; ‘an AVRO for REST” (Mike
Bryzek, Gilt Founder)
We strongly recommend zero-dependency
strongly-typed clients.

5. audit + alerting
How do we stay compliant while giving
engineers full autonomy in prod?
Really smart alerting: http://cavellc.github.io
orders[shipTo: US].count.5m == 0

6. io explosion
Each service call begets more service calls;
some of which are redundant...
=> unintended complexity and performance
Looking to lambda architecture for critical-path
APIs: precompute, real-time updates, O(1)
lookup

7. reporting
Many services => many databases => data is
centralized.
Solution: real-time event queues to a data-lake.

so…
how did I really feel about yesterday’s outage?
great.

svc-localised-string
mongodb
login-reg mosaic
product
listing
product
search
product
search
A localisation file was loaded
with an character encoding
The driver spun on CPU, consuming
CPU credits
The service was small: it was re-written
in about an hour, deployed and fixed
the site.
We knew exactly where the problem
was.
We focussed and rapidly deployed
tentative incremental fixes.
Once we fixed that problem, all of our
problems were fixed.
Try that in a monolith :)

GeeCON Microservices 2015 scaling micro services at gilt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to GeeCON Microservices 2015 scaling micro services at gilt

Similar to GeeCON Microservices 2015 scaling micro services at gilt (20)

More from Adrian Trenaman

More from Adrian Trenaman (11)

Recently uploaded

Recently uploaded (20)

GeeCON Microservices 2015 scaling micro services at gilt