Project Skyfall                                                Matt Abrams (@abramsm)
Agenda A bit about AddThis! ! Why did we need Skyfall?! ! Architecture! ! Operations/Performance!
Introduction!
Fun with NumbersAddThis JavaScript loads > 3 Billion times per dayEdge Network (Skyfall) receives around 4B hits perdayEit...
Data Center Porn
Why did we need Skyfall?We couldn’t find anyone else to do it for us    •  Pervious vendors log aggregation was delayed by ...
Why did we call it Skyfall?
Why did we call it Skyfall?
Skyfall Goals and Architecture!
Skyfall Goals (Technical)High Availability                      Handle Server and DC failure                              ...
Why speed and robustness matters
Architecture                              Web Event                                          Web Event                    ...
1.    Messages are placed on concurrent non-blocking queue      (CNBQ) to minimize latency impact on producer2.    Message...
KafkaKafka is treats persistence as a first class citizenFocus is on high throughput vs lots of bells and whistlesState abo...
Circuit Breaker for remote ServicesPattern is used to detect failures and encapsulates logic ofpreventing a failure to reo...
What does a call to our endpoint look like?Topic •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://      s7.addthis.co...
What does a call to our endpoint look like?              VersionTopic •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:...
What does a call to our endpoint look like?              Version ResourceTopic •    "GET /live/t00/250lo.gif&foo=bar" 200 ...
What does a call to our endpoint look like?              Version Resource URL ParamsTopic •    "GET /live/t00/250lo.gif&fo...
What does a call to our endpoint look like?              Version Resource URL Params   Status CodeTopic •    "GET /live/t0...
What does a call to our endpoint look like?              Version Resource URL Params   Status CodeTopic                   ...
What does a call to our endpoint look like?               Version Resource URL Params   Status CodeTopic                  ...
What does a call to our endpoint look like?             Version Resource URL Parameters Status CodeTopic                  ...
Zero Downtime Deployment andConfigurationGroup 1                         4             8             16 S1       S2   S2   ...
Endpoint ConfigurationEach endpoint maps to a ‘topic’Header elements may be extracted from the HTTP requestParameters may b...
Data Center RepeaterDC Repeater nodesautomatically negotiate         N1peering relationships withnodes in the other data  ...
Skyfall Operations!
Requests per/second (VA Data Center)
TCP - When do you say goodbye?      http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
Connection Tracking – what you need toknowConnection information is maintained in memoryThe message: “ip_conntrack: table ...
HA ProxyWe use a simple round-robin load balancing algorithm with aliveness checkDefault connection timeouts are way to hi...
Big datadc skyfall_preso_v2
Big datadc skyfall_preso_v2
Big datadc skyfall_preso_v2
Upcoming SlideShare
Loading in …5
×

Big datadc skyfall_preso_v2

2,716 views
2,632 views

Published on

An overview of project Skyfall. A globally distributed fault tolerant event consumption framework used by AddThis.com to consume billions of events per day.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,716
On SlideShare
0
From Embeds
0
Number of Embeds
89
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big datadc skyfall_preso_v2

  1. 1. Project Skyfall Matt Abrams (@abramsm)
  2. 2. Agenda A bit about AddThis! ! Why did we need Skyfall?! ! Architecture! ! Operations/Performance!
  3. 3. Introduction!
  4. 4. Fun with NumbersAddThis JavaScript loads > 3 Billion times per dayEdge Network (Skyfall) receives around 4B hits perdayEither datacenter can handle 100% load (we test thisoften) Currently using around 1K servers (will double nextyear)
  5. 5. Data Center Porn
  6. 6. Why did we need Skyfall?We couldn’t find anyone else to do it for us •  Pervious vendors log aggregation was delayed by a minimum of 3 hours and could take up to 5 daysMinimize impact on our publishers •  Combining log collection with remote services means we only need 1 event instead of nSupport near real time applications
  7. 7. Why did we call it Skyfall?
  8. 8. Why did we call it Skyfall?
  9. 9. Skyfall Goals and Architecture!
  10. 10. Skyfall Goals (Technical)High Availability Handle Server and DC failure gracefullyLow latency Zero downtime deployment and configurationUse for internal and external Loggingneeds In session RPCO(1) reads and writes Support data filtering at the edgeSmart Clients
  11. 11. Why speed and robustness matters
  12. 12. Architecture Web Event Web Event Web Event Global Traffic Management DC1 DC2 Skyfall Skyfall Skyfall Skyfall Skyfall Skyfall Repeater Consumer Service Consumer Consumer Consumer Service ServiceConsumer Service Consumer Consumer Service Consumer Service
  13. 13. 1.  Messages are placed on concurrent non-blocking queue (CNBQ) to minimize latency impact on producer2.  Messages are then popped from CNBQ and placed on a Disk-Backed queue (DBQ)3.  DBQ is used to provide temporary storage in case Kafka is down or backed up4.  Messages from DBQ are popped and sent to Kafka where they are persisted to file system
  14. 14. KafkaKafka is treats persistence as a first class citizenFocus is on high throughput vs lots of bells and whistlesState about what has been consumed is maintained in theclient rather than the serverKafka is explicitly distributedSupports O(1) reads and writesPull rather than push http://incubator.apache.org/kafka/design.html
  15. 15. Circuit Breaker for remote ServicesPattern is used to detect failures and encapsulates logic ofpreventing a failure to reoccur constantly[1]If a service instance throws an error, times out, or respondswith a failure message an error event is markedIf the error rate threshold is exceeded that service instance isremoved from the pool of available servicesBefore re-adding a service to the pool a test request is madeand validatedInternal service failures should not be reflected in response tomessage originator [1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
  16. 16. What does a call to our endpoint look like?Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  17. 17. What does a call to our endpoint look like? VersionTopic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  18. 18. What does a call to our endpoint look like? Version ResourceTopic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  19. 19. What does a call to our endpoint look like? Version Resource URL ParamsTopic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  20. 20. What does a call to our endpoint look like? Version Resource URL Params Status CodeTopic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  21. 21. What does a call to our endpoint look like? Version Resource URL Params Status CodeTopic Bytes Transferred •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  22. 22. What does a call to our endpoint look like? Version Resource URL Params Status CodeTopic Bytes Transferred •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"! CDN Resource User Agent
  23. 23. What does a call to our endpoint look like? Version Resource URL Parameters Status CodeTopic Bytes Transferred "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)" CDN Resource User Agent The endpoint also receives header and cookie information not Shown here.
  24. 24. Zero Downtime Deployment andConfigurationGroup 1 4 8 16 S1 S2 S2 S3 S3 S4 S4 S5 S5Group 2 4 8 16 S1 S2 S2 S3 S3 S4 S4 S5 S5
  25. 25. Endpoint ConfigurationEach endpoint maps to a ‘topic’Header elements may be extracted from the HTTP requestParameters may be mapped to new key namesVariables may be extracted from the URL path
  26. 26. Data Center RepeaterDC Repeater nodesautomatically negotiate N1peering relationships withnodes in the other data N1center N2If a peer node becomesunreachable the local node N2will select a new peer N3These are special consumersof the Kafka log data createdby the local node
  27. 27. Skyfall Operations!
  28. 28. Requests per/second (VA Data Center)
  29. 29. TCP - When do you say goodbye? http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
  30. 30. Connection Tracking – what you need toknowConnection information is maintained in memoryThe message: “ip_conntrack: table full, dropping packet” isBADChrome – doesn’t close connection on FINThis means that the connection info remains open until ittimes out, drastically increasing the number of connectionyour server needs to trackYou need some mechanism for timing out the connection in areasonable time period
  31. 31. HA ProxyWe use a simple round-robin load balancing algorithm with aliveness checkDefault connection timeouts are way to high. Reasonablevalues are used to prevent excessive connection tracking“http-close” and “http-server-close” are enabled to ensure lowlatency for clients and fast session reuse for the serverHA Proxy is our solution of choice our LB needs. We prefersoftware solutions on commodity hardware vs expensivecustom LB appliancesThey could use a new logo

×