Slides from my presentation on Micro-Services at Gilt at JavaOne 2015, San Francisco. I've given this preso a number of times, and this iteration includes some notes on how we classify and track our services, how we've inferred an emergent architecture, tracked ownership, and identified complexity.
12. driving forces behind gilt’s emergent
architecture
● team autonomy
● voluntary adoption (tools, techniques,
processes)
● kpi or goal-driven initiatives
● failing fast and openly
● open and honest, even when it’s difficult
18. … we used a “spread sheet”.
‘The Gilt Genome Project’
19. It’s hard to think of architecture in one dimension.
We added ‘Functional Area’, ‘System’ and ‘Subsystem’ columns to Gilt Genome;
provides a stronger (although subjective) taxonomy than the previous ‘tags’.
It turns out we have an elegant, emergent architecture.
Some services / components are deceptively simple.
Others are simply deceptive, and require knowledge of their surrounding
‘constellation’
n = 265, where n is the number of services.
22. Gilt Admin (Legacy Ruby on Rails Application)
City
Discounts
Financial
Reporting
Fraud Mgmt
Gift Cards
Inventory
Mgmt
Order Mgmt
Sales Mgmt
Product
Catalog
Purchase
Orders
Targetting
Billing
Other Admin Applications (Scala + Play Framework)*
City Creative (2) CS
Discounts Distribution i18n Inventory (2)
Order
Processing
(2)
Util
Service Constellations (Scala, Java)*
Auth (1) Billing (1) City (6) Creative (4) CS (2) Discounts (1)
Distribution
(9)
i18n (3) inventory (6)
Order
Processing
(8)
Payments (3)
Product
Catalog (5)
Referrals (1) Util (2)
Core Database - ‘db3’
Job System (Java, Ruby)
Gilt Logical Architecture - Back Office Systems
* counts denote number of service / app components.
Simply deceptive:
service context only
make sense in
constellation.
25. Lift-and-shift + elastic teams
Existing Data Centre
Dual 10Gb direct connect line, 2ms latency.
‘Legacy VPC’
MobileCommon
Person-
alisation
Admin Data
(1) Deploy to VPC
(2) ‘Department’ accounts for elasticity & devops
32. Lessen dependencies
between teams: faster code-
to-prod
Lots of initiatives in parallel
Your favourite
<tech/language/framework>
here
We (heart) μ-services
Graceful degradation of
service
Disposable Code: easy to
innovate, easy to fail and
move on.
33. We (heart) cloud
Do devops in a
meaningful way.
Low barrier of entry for
new tech (dynamoDB,
Kinesis, ...)
Isolation
Cost visibility
Security tools (IAM)
Well documented
Resilience is easy
Hybrid is easy
Performance is great
35. 1. staging vs test-in-prod
We find it hard to maintain staging environments across
multiple teams with lots of services.
● We think TiP is the way to go: invest in automation, use
dark canaries in prod.
● However, some teams have found TiP counter-
productive, and use minimal staging environments.
36. 2. ownership
Who ‘owns’ that service? What happens if that
person decides to work on something else?
We have chosen for teams and departments to
own and maintain their services. No throwing
this stuff over the fence.
37. 1. Software is owned by
departments, tracked in
‘genome project’. Directors
assign services to teams.
2. Teams are responsible for
building & running their
services; directors are
accountable for their overall
estate.
bottom-up ownership, RACI-style
39. 3. deployment
Services need somewhere to live. We’ve open-sourced
tooling over docker and AWS to give:
elasticity + fast provisioning + service isolation
+ fast rollback
+ repeatable, immutable deployment.
https://github.com/gilt/ionroller
40. 4. lightweight APIs
We’ve settled on REST-style APIs, using http:
//apidoc.me. Separate interface from
implementation; ‘an AVRO for REST” (Mike
Bryzek, Gilt Founder)
We strongly recommend zero-dependency
strongly-typed clients.
41. 5. audit + alerting
How do we stay compliant while giving
engineers full autonomy in prod?
Really smart alerting: http://cavellc.github.io
orders[shipTo: US].count.5m == 0
42. 6. io explosion
Each service call begets more service calls;
some of which are redundant...
=> unintended complexity and performance
Looking to lambda architecture for critical-path
APIs: precompute, real-time updates, O(1)
lookup
43. 7. reporting
Many services => many databases => data is
centralized.
Solution: real-time event queues to a data-lake.