LinkedIn runs the world's largest professional community with 500M worldwide members, 10M active jobs, 9M companies and 100k articles published weekly. A stable and scalable Jira deployment was essential in the journey to meeting these milestones. Accommodating 2.5 million hits a day on a Jira instance with 3 million issues demands focus, governance, and innovation. The Jira team at LinkedIn operationalized a single deployment of Jira to support a multitude of use cases across LinkedIn. Join Arnie and Dan to hear how they grew Jira to meet the aggressive demands of a hyper growth company.
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Focus, Governance, and Innovation: How LinkedIn Scaled to 3M Jira Issues and 500M Members
1. LinkedIn scales to 3M issues
and 500M members
A tale of process, people, and technology
ARNIE MATZ | DIRECTOR, SOFTWARE ENGINEERING | LINKEDIN
DAN HATA | SENIOR ENGINEERING MANAGER | LINKEDIN
2.
3. 500,000,000+ registered members
138+M
UNITED STATES OF AMERICA
29+M
Brazil
42+M
India
8+M
Indonesia
4+M
Phillippines
3+M
Malaysia
1+M
Singapore
1+M
Japan
1+M
Korea
32+M
China
1+M
New Zealand
8+M
Australia
23+M
UK
14+M
France
10+M
Germany
10+M
Italy
9+M
Spain
7+M
Netherlands
3+M
Belgium
2+M
Denmark
2+M
Sweden
1+M
Ireland
13. What did LinkedIn think of Jira in 2015?
Why does my
dashboard freeze at
10am?
Jira was fast at my
last company.
Why don’t we just
build our own Jira?
Why does Jira always
crash?
Have we looked into
alternatives?
Why is Jira always
slow?
Will Jira ever be stable?
16. 2015 Scary Stuff
Stability and Performance
No understanding of Jira stability and
performance issues.
No change control process.
Unlimited admins
Many people have admin access
making change control and
standardization impossible.
Lucene index corruption
25% of Jira restarts resulted in index
corruption and recovery takes hours.
Rapid custom field growth
Contributes to index growth. There
was no governance. Admins said yes
to all custom field requests.
17. No governance
No change control
No metrics
2015 ASSESSMENT SUMMARY
Unplanned outages almost every day
Sometimes all day
Thousands of custom fields
Growing by 150% year over year
....and way out of control
19. CHECK POINT
1.2 million issues
300+ million members
6,000+ employees
2015
People:
Process:
Technology:
CRITICAL
REVIEW
CRITICAL
20. Roles For Supporting Jira
App Admins
Focus on customer
service: external
consultants
Operations
Focus on deterministic
change and mitigating
risk
Developers
Ensuring performance
and scale are built into
all solutions
Managers
Focus on governance,
strategy, Atlassian
partnership.
30. Misuse: What can I do?
• Document and communicate what is
acceptable use
• Work with users to find the right solution
• Through technology, make it impossible
for misuse to reoccur
2015 Process Improvements
31. Change Control
• Configuration as code
• All changes are tested,
reviewed, communicated,
with rollback plans
Service Level Objectives
• Tracked and investigated violations
• Example: <2 seconds issue creation time
2015 Process Improvements
32. Atlassian Relationship
• Introduced TAM
• Added Premier Support
• Partnered with TAM and PS at Atlassian
to target a performance upgrade
• Extended licensing for end of life plugin
2015 Process Improvements
33. Monitoring and SLOs
Atlassian relationship
Governance
SUCCESS
WARNING
Change control ALMOST
2015 Process: After
ALMOST
44. CHECK POINT
2 million issues
400+ million members
9,000+ employees
2016
1.2 million issues
300+ million members
6,000+ employees
2015
People:
Process:
Technology:
CRITICAL
REVIEW
CRITICAL
ALMOSTPeople:
Process:
Technology: REVIEW
REVIEW
48. 2016 Process Improvements
Governance
• Documented and communicated
• All requests lead with business requirement
• Scale is the most important requirement
• Automated Governance
49. 2016 Process Improvements
Operational Excellence Culture
• Code and config reviews
• Intelligent risk decision
• Change control and communication
• Monitoring and metrics
• Automated remediation
• Service level objectives
• Awesome alerts and response
• Business continuity plan
• Relentless pursuit of exceptions causation
• Blameless postmortems
50. 2016 Process Improvements
Partnering with Atlassian
• TAM relationship: evolved from tactical to
strategic in 2016
• Partnering with TAM for all major upgrades
• Atlassian Premier Support provides critical bug
fix over the holidays to address bug in widely
used gadget
53. User blacklisting and throttling
• Implemented blacklisting based on username
• Throttling based on requests/minute per host
2016 Technology Improvements
#Jira.conf
# Blacklist a user to by adding and entry with
value of 1.
map $remote_user $user_blacklisted {
default 0;
"johnnynumberfive" 1;
}
58. Logstash
Parsing logs to make useful
data
Adding in ELK
Kibana
Create dashboards showing
insightful data
Elastic Search
Horizontally scalable data
storage
63. CHECK POINT
2 million issues
400+ million members
9,000+ employees
2016
3 million issues
500+ million members
10,000+ employees
1.2 million issues
300+ million members
6,000+ employees
People:
Process:
Technology:
CRITICAL
REVIEW
CRITICAL
2015
ALMOST
SUCCESS
People:
Process:
Technology: ALMOST
2017
ALMOSTPeople:
Process:
Technology: ALMOSTREVIEW
REVIEW
70. Real User Monitoring
• Performance regression reports emailed daily
• Response times include rendering
• Global statistics give us insight into latency
2017 Technology Improvements
71.
72.
73. Jira Data Center
• 4 nodes improves our MTTR by avoiding lengthy
index rebuilds
• Resilient from the "single click of death"
2017 Technology Improvements
74. Getting to Scale
Always ask why!
Invest in the team
Build vendor relationship
Lather, rinse, repeat
75.
76. Thank you!
ARNIE MATZ | DIRECTOR, SOFTWARE ENGINEERING | LINKEDIN
DAN HATA | SENIOR ENGINEERING MANAGER | LINKEDIN