• Save
Reflecting a year after migrating to apache traffic server
Upcoming SlideShare
Loading in...5
×
 

Reflecting a year after migrating to apache traffic server

on

  • 7,922 views

LinkedIn started May 2003, I started August 2011. Over 8 years of cruft and confusion piled up before we even considered moving to Apache Traffic Server. This talk will focus on the journey and what ...

LinkedIn started May 2003, I started August 2011. Over 8 years of cruft and confusion piled up before we even considered moving to Apache Traffic Server. This talk will focus on the journey and what we learned along the way:
* What LinkedIn is doing with ATS to affect change across the entire stack with a infrastructure tier
* Building automation and tooling
* Bizarre scenarios how users are querying the site
* Metrics and monitoring
* Patches we contributed

Statistics

Views

Total Views
7,922
Views on SlideShare
7,314
Embed Views
608

Actions

Likes
30
Downloads
0
Comments
1

12 Embeds 608

http://velocityconf.com 309
http://disp.cc 155
http://lanyrd.com 49
http://bukulinux.wordpress.com 33
https://twitter.com 26
http://www.linkedin.com 11
http://tspace.web.att.com 10
https://www.linkedin.com 9
http://cwiki.apache.org 2
http://disp1.disp.cc 2
http://cdn.disp-tech.com 1
http://regodevapp01.dev.act.gov.au 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • great
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • https://iwww.corp.linkedin.com/wiki/cf/display/~niberry/Velocity+2013+Proposalhttp://velocityconf.com/velocity2013/public/schedule/detail/28461
  • I like this one a lot
  • Really this talk is about how introducing ATS into the LinkedIn stack completelychanged how we tackled several complex problems
  • I started working at LinkedIn August 2011, I manage the SRE responsible for Core, Security and Presentation Infrastructure. We support Identity infrastructure, Growth/Registration, Engagement and several other systems that I spare you the details on since you’re here to learn about ATS… so what is it?
  • Bad ass HTTP proxyMulti-threadedNon-blocking I/OPluggableWell known for cachingInktomi wrote sometime in the late mid-90s and Yahoo open-sourced it in 2010.
  • So a few companies are using it… and these are just the people that bothered to put their logo on the Customers page
  • If you wanted a feature, it would be built as a Tomcat filter and deployed out to the majority of the site. For anything not running in Tomcat, there was not a solutionLots of frontends on lots of hosts
  • LinkedIn started acquiring companies, really difficult to integrate their stack into our own.We’re in a heterogeneous environment, supporting features across multiple platforms is a requirement and abstracting that from the frontend itself into an infrastructure tier completely changes the game.give a story?
  • Centralize the effort of these features. Acquisitions become first-ish class citizensNeed to make a change, push out the plugin to the ATS tier instead of coordinating with all the service owners for weeks to update/deploy their codeReduce the time to deploy
  • slow down the delivery on this oneHA Proxy and Varnish were not consideredNginx, ATS: maturity, scalability. modular , in house knowledgehttps://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Comparison+of+TS+and+nginxdynamically load plugins without having to recompile a new server binary gave us the flexibility we needed to enable/disable features quickly
  • Before we even started, we had patches for ATS addressing an issue with keep-alive handling and adding support for remap_with_recv_port to allow requests from different incoming ports to different originsNo good source of truth how to route requests, so we had to audit configs and access logs to build the configMetrics for non-Java services at LinkedIn really didn’t exist. We built a Python-based framework to seamlessly fit into our monitoring modelAfter all this, we could start migrating!
  • We were a little ambitious
  • Ok, very ambitious
  • L1 Proxy will be ATS with a few plugins
  • First, we migrated our SEO optimized Profile pages to L1 Proxy. Allowed us to support routing unauthenticated users out of ECH3
  • Certain requests are able to served out of different data centers. So for public profile requests from our signed-out users, we can route them to our Chicago data center for improved RTT.
  • if there is a cookie named foo and it starts with bar, route it to the moon.
  • Drop the request at L1 Proxy instead of wasting cycles on the frontend. Allowed for us to automatically deny requests based on limits, without the need for people to scan access logs and manually block IP addressesWe were prepared with our new Sentinel plugin before Christmas, but decided to delay enabling until after the holidays. Since scrapers do not seem to celebrate New Years, we were forced to enable on New Years Day 2012, and it worked!
  • Hopefully you had a chance to attend Veena’s talk yesterday on The Curious Case of Dust JavaScript and Performance. In case you didn’t make it, I’ll explain what it does
  • call out Veena’s talk, you’ve missed it but you can see more information here…add note about USSR + V8
  • At a high level, this allows your app to do specifically what it’s supposed to. Before Fizzy, LinkedIn would ship code into multiple frontends so they could render the module within the various services.Now… People You May Know’s module can be fetched and embedded into the Profile page, while an Ad can be pulled in from another frontend.
  • Up until this point, configs were manually edited and deployed. this sucked, big time. When we started, there was no great source of truth of the data we needed. By this point, LinkedIn SRE had a metadaa store in place with most of the data in place and we just needed to fill the gaps. Significantly improved managing configs and reduced the amount of human error.Teams across the company wanted to build ATS plugins. This was both good and bad. Good that we were able to solve some difficult problems, bad that our proxy tier was becoming more complicated.The Mobile team wrote a plugin to detect when to issue redirects to Mobile pages, instead of handling in the Tomcat frontendsSecurity started manipulating and enforcing cookies, addressing legacy issues that were traditionally difficult to track downThis also lead to the development of QD Proxy…
  • The key design behind Quick Deploy is that if you develop a service locally, you can initiate a request to that service, either directly or indirectly, by using QD Proxy in LinkedIn’s staging environment. All other components of my request go to the Staging environment.
  • If I have a minor tweak to make, and not ready to commit, I can set up a QD Proxy profile to route the request to my dev box for the Frontend request and my Frontend will talk back to QD Proxy for all the backend calls, which will be sent to the backends in Staging.Really freakin’ sweet.
  • I can also have a frontend in Staging send the backend requests to my dev box based on my QD Proxy profile.Testing features before committing against a complete environment is now possible.
  • not because of Fizzy, but from a compounding loop through a single pair of load balancers causing the LB pair CPU to spike and drop requests. this sucks! we were so close to finishing the migrations and now we’ll have buy new load balancers to handle the load… or will we?That's when HA Proxy came into our lives.The Reliable, High Performance TCP/HTTP Load BalancerSince we already had all the data in RangeWe could generate our HA Proxy configs in seconds and deploy them in minutesAllow us to automate load balancing changes without having to make changes on network gear
  • not because of Fizzy, but from a compounding loop through a single pair of load balancers causing the LB pair CPU to spike and drop requests. this sucks! we were so close to finishing the migration and now we’ll to handle the load… or will we?
  • That's when HA Proxy came into our lives.The Reliable, High Performance TCP/HTTP Load BalancerSince we already had all the data in RangeWe could generate our HA Proxy configs in seconds and deploy them in minutesAllow us to automate load balancing changes without having to make changes on network gear
  • Moving to HA Proxy gave SRE:complete control over how we handled the load balancingreduced requests to NetOps, in turn reduced our turn around timereduced network hops between ATS and the Frontendremoved single points of failure between L1 Proxy and Fizzy by eliminating the load balancer
  • Month 10: we did it! www.linkedin.com migrated behind L1 Proxy and Fizzy.
  • Startedout at 4000 QPS and 120M members, today we’re nearing 70000 QPS at 225M members. that’s approximately a 4x increase in QPS, year over year. NetScaler is still in play, but now only providing the load balancing to L1 Proxy as well as SSL termination. At this point, we’re able to consider possibilities of removing the NetScaler all-together.Bug fixes for features in ATS can be rolled out in hours not days and our acquisitions get all the goodness the rest of the site does
  • Stability:* When you’re introducing a critical tier and forcing everyone onto it, customer service is key. Spent many hours debugging invalid (and some valid) escalations to help build confidenceInvalid requests:POST requests with no content-length and no bodyConnection failed:Clients using CONNECT for no reason!
  • Here are 15 out of 30 outages since we set up L1 Proxy and FizzyEach outage reminded us of the impact from even the smallest of changes. There will be mistakes, there will be unexpected surprises.If you’re going to fail, do it quick and recover fast. Learn from the mistakes, and avoid the repeaters.
  • We’re doing more with ATS than ever before and the outage rate is not affected by it.issues with plugins are now caught earlier in the development process. they’re performance tested before going to staging, deployment schedule with strict guidelines to ensure testing/verification is done before promoting to production.and you can see a downwardtrend with the Human factor
  • We strive to keep our graphs looking good, even when they’re bad… so much so that my team will draw on post-it notes to cover up nasty outages. So how did we do this… with a few different tools.
  • I suggest summarizing some of the data, unless you’re prepared to consume all the metrics
  • Great for reading variables (core + plugins) to monitor.
  • don’t want to shell out to gather metrics, there’s an HTTP endpoint!awesome, right?
  • we take start_time and use it to calculate uptime by subtracting start_time from time.time()
  • tracking start_time helps highlight crashes, deployments, people doing things to the service that shouldn’t be
  • monitor trends coming in and going out
  • We track how close we’re getting to the throttle limit.
  • that’s bad
  • Core dump rate: monitoring file system for core dumps < 24 hours, alert if >NTCP States: captured from netstat, watching for spike in TIME_WAITProc: memory usage, swap usage, file descriptor usage
  • They don’t give enough of a picture for the requests we’re processing. If you need to debug a problem, we need a combination of these familiar logging formats… fortunately there’s custom logging
  • Log request headers, response headers, timing, originWe tail -f the log, aggregate and report timing for given paths (something we don’t get with traffic_logstats)
  • If someone adds or removes a host from the deployment system’s topology, our config generatorswill pick it up. We even have some of these configs ready to be headless so changes will be automatically propagated.Salt is an open source remote execution framework written in Python. Since we can write Python modules to do whatever we want, we’re able to create the pre/post hooks necessary for rolling out changes:take host out of rotationbleed trafficconfirm it’s out of rotationupgrade packagesinstall configsrestart trafficserververify process is runningreview log filesgo back into rotationbefore these steps were all done by a human, and ultimately led to mistakes. we now automate these tasks and iterate on them every time we learn how to better the process.quick plug on inFormed
  • inFormed is our in-house report of things happening in productionFed through multiple bridges:jira ticketing systemircdeploymentswhatever
  • available in the experimental section of plugins
  • This is an example
  • Google DWR recently was updated in the last couple weeks and caused Chrome to bomb out on one of our javascript files. Within 20 minutes, we had a temporary fix deployed into production to issue the correct Content-Type header.
  • Why would you need Boom?
  • Enabled anyone to debug production issues against a single host instead of scanning for your request across 50+ servers. I can pin my requests to a specific L1 Proxy host, through a specific Fizzy host and then to a specific Profile host. Hell yes!
  • This is even more awesome due to ATS’ non-blocking I/O, avoids burning up threads on the frontendsWho has looked at LinkedIn’s “View as Source”?
  • Saving 10% per request at the expense of idle CPU is a huge win!
  • Being tested in our staging environment as we speak. Potential CDN savings still to be calculated.
  • traffic_manager was unable to communicate with traffic_server because of a hard-coded file descriptor limit of 32 for the internal healthcheck. so traffic_server would restart every ~2 minutes and you see an uptime graph like this…
  • when ATS hits the connection_throttle limit, it would never get out of the throttle until restarting ATS.
  • as we started adding our own stats, there was no checking in place to prevent a plugin from creating too many variables/metrics and the {stat} end point was not able to return the results within the given buffer.https://git-wip-us.apache.org/repos/asf?p=trafficserver.git&a=search&h=HEAD&st=author&s=brianghttps://git-wip-us.apache.org/repos/asf?p=trafficserver.git&a=search&h=HEAD&st=author&s=manjeshnilangehttps://issues.apache.org/jira/issues/?jql=project%20%3D%20TS%20AND%20reporter%20in%20(manjeshnilange%2C%20briang%2C%20manjesh)%20AND%20updated%20%3E%202011-06-01----- Meeting Notes (6/18/13 11:50) -----write out what you want to say
  • https://git-wip-us.apache.org/repos/asf?p=trafficserver.git&a=search&h=HEAD&st=author&s=brianghttps://git-wip-us.apache.org/repos/asf?p=trafficserver.git&a=search&h=HEAD&st=author&s=manjeshnilangehttps://issues.apache.org/jira/issues/?jql=project%20%3D%20TS%20AND%20reporter%20in%20(manjeshnilange%2C%20briang%2C%20manjesh)%20AND%20updated%20%3E%202011-06-01
  • We have rewritten a few of our plugins to use this new API, one of them is literally half the code it was before.This has enabled us to growfrom 2 engineers to working on ATS plugins to 6 engineers and the ramp up time for plugin development is dramatically reduced.Doug’s comment on atscppapi:I'd say the main feedback is that it's *really* easy to use compared to the raw API. Hides all the grunge, and just lets you focus on your logic. I wrote a transform plugin that would probably have taken me weeks of struggling with virtual I/O buffers and so on in just a few hours, and that included learning the basics of the API. Now that I've done it once, it would be even faster. So far I haven't hit any limitations of the abstraction. It does a excellent job of providing the functionality of the ATS in a way that matches the plugin developer's mental view of the tasks to be performed, rather than going from the mindset of internal ATS implementation. As long as you understand the basic concept of the ATS state machine, writing a new plugin is almost trivial.
  • Earlier this year, our Media origin for the CDN was nearing capacity. The NetApp filer’s CPU were over 50% and if we needed to failover, we would not have been able to serve Media requests (profile pictures, cached external content). Since we had so much success with ATS as a reverse proxy, why not try using it for its bread and butter... caching.
  • After a couple weeks of tweaking config and $30,000 in gear later, a caching layer was built on-top of our Media origin. We had 98% cache hit rate serving requests < 2ms. This reduced our NetApp filer’s CPU to less than 1%. The team responsible for the Media origin thought the NetApp CPU graphs were broken (we kind of forgot to mention we finished migrating the traffic over to the new cache)and savedthe company $400,000by avoiding having to upgrading our filers.Recap…$30,000 of commodity gear + ATS saved LinkedIn $400,000
  • Thank you for your time! Come meet the team behind ATS @ LinkedIn during our office hours at 1:15PM. We’re interested in answering any questions around our experiences of solving problems at LinkedIn with Apache Traffic Server.

Reflecting a year after migrating to apache traffic server Reflecting a year after migrating to apache traffic server Presentation Transcript

  • ©2013 LinkedIn Corporation. All Rights Reserved.Reflecting a Year After Migrating to Apache Traffic Server
  • ©2013 LinkedIn Corporation. All Rights Reserved.Have You Looked At Your Access Logs Lately?
  • ©2013 LinkedIn Corporation. All Rights Reserved.Surviving by Proxy
  • ©2013 LinkedIn Corporation. All Rights Reserved.Even Your Registrar Breaks Sometimes
  • ©2013 LinkedIn Corporation. All Rights Reserved.How Apache Traffic Server Changed LinkedIn
  • ©2013 LinkedIn Corporation. All Rights Reserved.Hello!
  • ©2013 LinkedIn Corporation. All Rights Reserved.ATS: Apache Traffic Server Fast, scalable and extensible HTTP/1.1 compliant caching proxy server Single-process, multi-threaded Asynchronous I/O Plugin architecture Written by Inktomi >10 years ago, Yahoo acquired Inktomi, found the codeon a system collecting dust in a cardboard box, and open-sourced in 2010
  • ©2013 LinkedIn Corporation. All Rights Reserved.ATS: Who’s using it?
  • ©2013 LinkedIn Corporation. All Rights Reserved.When we started… 4,000 QPS to www.linkedin.com 120M members Citrix NetScaler used for all external load balancing (XLB)– Load balances requests based on path to frontends– SSL termination– Monitors health per frontend Features were built as Tomcat filters– Tomcat required, no solution for alternates– >70 frontend services deployed across hundreds of hosts
  • ©2013 LinkedIn Corporation. All Rights Reserved.Outgrowing the existing solution Need to support multiple frontend frameworks– DoS protection– Authentication– Optimizations Complete control over features– Cookie manipulation– Advanced routing Deployment delays, security-related fixes took days if not weeks Even small changes required touching network gear
  • ©2013 LinkedIn Corporation. All Rights Reserved.How about an intelligent HTTP proxy layer? Less (re)implementing features into multiple frameworks Make decisions higher in the stack– Faster response time– Reduce work on the application stack Rapid iteration
  • ©2013 LinkedIn Corporation. All Rights Reserved.Where to start? Evaluated options Requirements:– Mature– Scalable– Language we like– Plugin support with hooks and documentation, shared libraries a big plus– Shared runtime information between plugins– In-house knowledge is a plus Apache Traffic Server matched our needs
  • ©2013 LinkedIn Corporation. All Rights Reserved.Preparation 4 patches out of the gate Audit traffic, build configs Build metrics, dashboards and alerts– Huge blocker, new territory for non-Java @ LinkedIn Migrate traffic, one service at a time
  • ©2013 LinkedIn Corporation. All Rights Reserved.Let’s migrate!Started migration in October 2011
  • ©2013 LinkedIn Corporation. All Rights Reserved.Let’s migrate!Started migration in October 2011“We’ll be done by Christmas!”- everyone
  • ©2013 LinkedIn Corporation. All Rights Reserved.Original PlanXLBL1 Proxy(ATS)VIPFrontend
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 1: Public Profile
  • ©2013 LinkedIn Corporation. All Rights Reserved.Request Rules RemapCookie-based routinge.g. logged-in vs. logged-outLos AngelesXLBL1 Proxy(ATS)VIPFrontendChicagoXLBL1 Proxy(ATS)VIPFrontendwww.linkedin.com
  • ©2013 LinkedIn Corporation. All Rights Reserved.Request Rules Remapif (request_cookie[”foo"] starts_with ”bar”)return "host:chicago.linkedin.com:8888";elsereturn "host:losangeles.linkedin.com:8888";
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 3: Sentinel (DoS protection)Prevent abusive requests from reaching frontendXLBL1 Proxy(ATS)VIPFrontend
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 4: Picking up momentumLargest frontends of the site done– Homepage– Profile– RegistrationNew ATS tier, Fizzy!
  • ©2013 LinkedIn Corporation. All Rights Reserved.New ATS tier, Fizzy! Edge Side Includes on steroids UI content aggregator Progressive Rendering– Browser deferred rendering– Browser deferred fetch– Server Supports Server Side Renderingof JavaScript templates via V8
  • 1342
  • ©2013 LinkedIn Corporation. All Rights Reserved.Now with Fizzy!XLBL1 Proxy(ATS)VIP VIPFrontend(non-fizzy)Fizzy(ATS)VIPFrontend
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 6: Most frontends migrated Config generators written Caught the attention of other teams– New plugins developed Another new tier, QD Proxy!
  • ©2013 LinkedIn Corporation. All Rights Reserved.Another new tier, QD Proxy!Quick Deploy Proxy– Define profiles for dev instances to route to– Allows multiple users to use the same profile– Develop without running the entire stack
  • ©2013 LinkedIn Corporation. All Rights Reserved.Quick Deploy Proxy: FrontendXLBL1 Proxy(ATS)FrontendFizzy(ATS)BackendQD Proxy(ATS)MyFrontend
  • ©2013 LinkedIn Corporation. All Rights Reserved.Quick Deploy Proxy: BackendXLBL1 Proxy(ATS)FrontendFizzy(ATS)BackendQD Proxy(ATS)MyBackend
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 9: Ramping Fizzy to 100%
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 9: Ramping Fizzy to 100% Broke the site
  • ©2013 LinkedIn Corporation. All Rights Reserved.Month 9: Ramping Fizzy to 100% Broke the site HA Proxy saves the day– “The Reliable, High Performance TCP/HTTP Load Balancer”– leverage the metadata in Range to generate configs– reduce network hops by avoiding hardware load balancer– deploy changes in minutes
  • ©2013 LinkedIn Corporation. All Rights Reserved.… and HA Proxy!XLBFrontendL1PROXY HAPROXYATSFIZZYHAPROXYATSL1PROXY HAPROXYATSFIZZYHAPROXYATSFrontend(non-fizzy)
  • ©2013 LinkedIn Corporation. All Rights Reserved.After all that… October 2011: 4,000 QPS, 120M members August 2012: 15,000 QPS, 175M members Now: 67,000 QPS, 225M members Citrix NetScaler still in use– Load balancing L1 proxy– SSL termination Features built as ATS plugins– Supports anything behind ATS tiers (L1 Proxy, Fizzy)– Quick to deploy
  • ©2013 LinkedIn Corporation. All Rights Reserved.Implementation October 2011 - August 2012 (10 months)
  • ©2013 LinkedIn Corporation. All Rights Reserved.Implementation October 2011 - August 2012 Unexpected surprises aka outages Scope creep– New tiers and architecture: Fizzy, HA Proxy– Lots of new plugins It takes time to build…– monitoring– tooling– configuration automation
  • ©2013 LinkedIn Corporation. All Rights Reserved.Outages Hand edited configs with typos Misbehaving node in rotation Bad upgrade from 2.x to 3.x due to incompatible hostdb Missing slash for a config, sent requests to wrong frontend Bonus slash to a healthcheck taking all hosts down SysOps re-imaged experimental hosts, broke 10% of Profile Saturated load balancer due to additional ATS layer Sticky cookie conflict between frontends HA Proxy wasn’t started Random ATS crashes Coal in our stocking for Christmas Multiple issues with multiple plugins Log4cpp hard-coded to DEBUG at root level for one plugin, overwrote for all plugins FD per-user limit unexpectedly changed Keep-alive unexpectedly turned on with high timeouts
  • ©2013 LinkedIn Corporation. All Rights Reserved.Outages (>0.1% requests affected)0% 20% 40% 60% 80% 100%201120122013Plugin ATS Human
  • ©2013 LinkedIn Corporation. All Rights Reserved.How did we improve?
  • ©2013 LinkedIn Corporation. All Rights Reserved.How did we improve? Monitoring!
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: traffic_logstats• per-origin breakdown:– status– method– QPS– bytes– etc.• Want JSON output? use -j• results are COUNTER, and GAUGE if the key ends in _pct
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: traffic_logstatsHTTP return codes Count Percent Bytes Percent------------------------------------------------------------------------------100 Continue 0 0.00% 0.00KB 0.00%200 OK 1,383,361 93.57% 4.71GB 97.48%201 Created 5,429 0.37% 3.28MB 0.07%202 Accepted 0 0.00% 0.00KB 0.00%203 Non-Authoritative Info 0 0.00% 0.00KB 0.00%204 No content 12 0.00% 5.63KB 0.00%205 Reset Content 0 0.00% 0.00KB 0.00%206 Partial content 0 0.00% 0.00KB 0.00%2xx Total 1,388,802 93.94% 4.71GB 97.54%300 Multiple Choices 0 0.00% 0.00KB 0.00%301 Moved permanently 3,360 0.23% 3.47MB 0.07%302 Found 38,475 2.60% 35.09MB 0.71%303 See Other 11 0.00% 3.87KB 0.00%304 Not modified 29,262 1.98% 12.20MB 0.25%305 Use Proxy 0 0.00% 0.00KB 0.00%307 Temporary Redirect 0 0.00% 0.00KB 0.00%3xx Total 71,108 4.81% 50.76MB 1.03%...
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: traffic_line• Swiss army knife for Traffic Server• executable to read variables
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}• prefer HTTP over shell?records.config:CONFIG proxy.config.http_ui_enabled INT 2remap.config:map /_stat/ http://{stat} @action=allow @src_ip=127.0.0.1
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}proxy.node.restarts.manager.start_timeproxy.node.restarts.proxy.start_time
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}proxy.node.restarts.manager.start_timeproxy.node.restarts.proxy.start_time
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}proxy.node.current_client_connectionsproxy.node.current_server_connections
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}proxy.config.net.connections_throttle limit before ATS starts to drop connections based on the sum of client and server connectionsproxy.process.net.connections_currently_open client + server connections
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}proxy.config.net.connections_throttle limit before ATS starts to drop connections based on the sum of client and server connectionsproxy.process.net.connections_currently_open client + server connections
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: {stat}Plugin specific reviewed prior plugins go to productionExamples enforced vs. un-enforced DoS requests track cookie usage for a migration thread usage of a plugin
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: outside the appCore dump rate– generate crash reports with full stack trace– monitoring file system for core dumps newer than -24 hours– alert if > NTCP– capture states from netstat– listen queue overflowing (net.core.somaxconn)Proc– review /proc/pid/status– fetch VmSize and VmSwap– count # of files in /proc/pid/fd for FD usage
  • ©2013 LinkedIn Corporation. All Rights Reserved.Monitoring: logs I HATE dislike the stock logs squid.log– mimics squid access log– more useful if you’re caching common.log, extended.log, extended2.log– Netscape formats– not enough detail custom logging!
  • ©2013 LinkedIn Corporation. All Rights Reserved.Custom Loggingrecords.configCONFIG proxy.config.log.custom_logs_enabled INT 1logs_xml.config<LogFormat><Name = ”custom_access"/><Format = "%<chi> %<{X-Real-Client-IP}cqh> - %<caun> [%<cqtn>] "%<cqhm> %<cquuc>%<cqhv>" %<pssc> %<pscl> "%<{Referer}cqh>" "%<{User-Agent}cqh>" %<ttms>ms%<cquc> %<{X-LI-UUID}psh>"/></LogFormat><LogObject><Format = ” custom_access"/><Filename = ”access"/></LogObject>
  • ©2013 LinkedIn Corporation. All Rights Reserved.Custom logging%<chi> 172.16.200.10%<{X-Real-Client-IP}cqh> 65.16.225.8%<caun> - (http authd username)[%<cqtn>] [01/Nov/2011:23:59:59 +0000]"%<cqhm> %<cquuc> %<cqhv>" "GET /nhome/ HTTP/1.1"%<pssc> 200%<pscl> 34697%<{Referer}cqh> “http://www.linkedin.com/"%<{User-Agent}cqh> "Mozilla/4.0 (compatible; ...)"%<ttms> 327ms%<cqu> http://origin:port/nhome/
  • ©2013 LinkedIn Corporation. All Rights Reserved.Dashboard: overviewInternal ATS: client connections server connections traffic_cop uptime traffic_server uptime connection failed invalid requestLogs: 2xx status 3xx status 4xx status 5xx status HTTP methodsOS: cpu usage interface tcp state distribution # of core dumps ATS memory usage ATS swap usage ATS file descriptor usage
  • ©2013 LinkedIn Corporation. All Rights Reserved.Dashboard: in-depth plugin-specific per-path histogram of request durations per-origin HTTP status breakdown HA Proxy– current sessions– denied requests– error requests– server status
  • ©2013 LinkedIn Corporation. All Rights Reserved.How did we improve? Automation!Configs are generated, not hand maintained– Details about a service are stored in metadata store– YAML configs supplement missing dataDeployment done by Salt– All deployment actions and verifications are– integrated with Informed
  • ©2013 LinkedIn Corporation. All Rights Reserved.Informed
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins!header-rewriterequest-rules-remapsentinellix-remaphost_overridepostbuffermobileredirectcorrectcookiedomainqdproxyboompagespeedcontentsecurityheaderauthfilteroauth-rewritestickyrouting
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: header-rewriteManipulate headers at any point in the request lifecycle– read request– send request– read response– send response Can use as a remap plugin– change path, destination, port Patched to include variables
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: header-rewritecond %{READ_REQUEST_HDR_HOOK} [AND]cond %{ACCESS:/var/healthcheck} [NOT]rm-header Connectionadd-header Connection "close”
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: header-rewritecond %{SEND_RESPONSE_HDR_HOOK} [AND]cond %{PATH} "/foo.js”add-header Content-Type "text/javascript”
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: lix-remapUses LinkedIn Experiments infrastructure (A/B testing) to make routingdecisions Enable NOC to easily send traffic to another data center Route specific users, LinkedIn employees or % of users to experimentaltiers Used for red-line performance testing of frontends
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: BoomWe don’t want to show users this…
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: Boom… but based on status code, we can replace it with this:
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: Host OverrideDirect your request to a specific host through any ATS tier
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedSupport on-the-fly operations before sending the response
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedHTML minification– How many empty new lines are on Profile?
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedHTML minification– How many empty new lines are on Profile?2703
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedHTML minification– How many empty new lines are on Profile?2703– How many empty new lines are on Homepage?
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedHTML minification– How many empty new lines are on Profile?2703– How many empty new lines are on Homepage?9205
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedHTML minificationHomepage: 78%Profile: 72%
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedHTML minificationHomepage: 10%Profile: 17%010000200003000040000Homepage ProfileCompressed bytes
  • ©2013 LinkedIn Corporation. All Rights Reserved.Plugins: PageSpeedLazy loading of images below the fold
  • ©2013 LinkedIn Corporation. All Rights Reserved.The awesome patches
  • ©2013 LinkedIn Corporation. All Rights Reserved.The awesome patches traffic_server gets restarted if FD > 32
  • ©2013 LinkedIn Corporation. All Rights Reserved.The awesome patches traffic_server gets restarted if FD > 32 infinite emergency throttle
  • ©2013 LinkedIn Corporation. All Rights Reserved.The awesome patches traffic_server gets restarted if FD > 32 infinite emergency throttle buffer overflow in the stats system
  • ©2013 LinkedIn Corporation. All Rights Reserved.Contributions back28 fixes committed back to open-source19 more pendingLinkedIn ATS committer, Brian Geffon
  • ©2013 LinkedIn Corporation. All Rights Reserved.ATS C++ APISimplifies the process of writing ATS pluginshttps://github.com/linkedin/atscppapiI wrote a transformation plugin that would probablyhave taken me weeks, struggling with virtual I/Obuffers, in just a few hours. Now that I’ve done itonce, it would be even faster.Doug YoungSr. Staff Software Engineer
  • ©2013 LinkedIn Corporation. All Rights Reserved.Almost forgot… Media Cache!Serves profile pictures, cached external contentPre-ATS– NetApp filer CPU >50%– Expected an outage during NetApp failover
  • ©2013 LinkedIn Corporation. All Rights Reserved.Almost forgot… Media Cache!Serves profile pictures, cached external contentPre-ATS– NetApp filer CPU >50%– Expected an outage during NetApp failoverPost-ATS– 98% cache hit rate– $30,000 in gear, saved $400,000– Bought us time to re-architect the service
  • ©2013 LinkedIn Corporation. All Rights Reserved.So what are the takeaways? ATS is a bad ass HTTP proxy Small details matter, fight for the users HA Proxy is a silver bullet Slow down, learn for your mistakes. Don’t just use open-source, contribute
  • ©2013 LinkedIn Corporation. All Rights Reserved.Meet the teamManjesh Nilange Brian Geffon Thomas JacksonNick BerryOffice hours @ 1:15 PMExhibit Hall (Table 2)
  • ©2013 LinkedIn Corporation. All Rights Reserved.Links This talk: Apache Traffic Server:– http://trafficserver.apache.org ATS C++ API:– https://github.com/linkedin/atscppapi New plugins:– https://github.com/linkedin/ -- coming soon!
  • ©2013 LinkedIn Corporation. All Rights Reserved.Goodbye!