Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Couchbase at DIRECTV: Using Couchbase for Large Scale Microservices – Connect Silicon Valley 2018

50 views

Published on

Speaker: Fidencio Garrido, Principal Engineer, DIRECTV

Running a microservices environment at enterprise-scale is quite challenging – you’re deploying hundreds of services that need to perform well, and to succeed you need to maximize the efficiency of your infrastructure as soon as possible. At AT&T Entertainment Group, principal engineer Fidencio Garrido and team learned that to keep innovating they needed to have a set of common practices to help them stay focused on building the next big thing instead of constantly firefighting. In this session, Garrido will share some of the practices that his team developed to fine tune its development stack, including Node.js, N1QL, efficient I/O, and more.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Couchbase at DIRECTV: Using Couchbase for Large Scale Microservices – Connect Silicon Valley 2018

  1. 1. Node + Couchbase best practices
  2. 2. About me Principal Engineer at AT&T 21 years of experience coding in: Pascal, Java, php, C#, Javascript, Kotlin, Python and others Now learning: Go and Rust Big fan of (American) Football @elfidomx
  3. 3. AT&T World presence Areas: - Mobility - Entertainment
  4. 4. Have you seen us? 📱 iOS/Android phones and tablets 📺 AppleTV, Satellite, Roku, Chromecast and smart tvs 🖥 Web browser ✈️ Airplanes
  5. 5. Our platform Hundreds of services running in Kubernetes Mixed environment of java, node, typescript, brightscript, scala, kotlin, groovy, swift, objective c, go, c, python, etc. Communication via REST, Kafka, MQTT and NATS
  6. 6. What is a second?
  7. 7. Different - Video Formats - Transports (satellite and Internet) - DRMs - Business rules and features: per network, channel, device and content 1 second of DirecTV Now video Behind the scenes - 600,000 operations in Couchbase - Tons of gigabytes processed - About 3 gb of metrics captured - 300+ Channels encoded in real time - At least 6 CI/CD builds - Hundreds of new lines of code - 600k authentication requests
  8. 8. Disclaimer There might be always something faster _start: ; get sum of num1 and num2 mov rax, num1 mov rbx, num2 add rax, rbx ; compare rax with correct sum - 150 cmp rax, 150 ; if rax is not 150 go to exit jne .exit ; if rax is 150 print msg jmp .rightSum ; Print message that sum is correct .rightSum: ;; write syscall mov rax, 1 ;; file descritor, standard output mov rdi, 1 ;; message address mov rsi, msg ;; length of message mov rdx, 15 ;; call write syscall syscall ; exit from program jmp .exit ; exit procedure .exit: mov rax, 60 mov rdi, 0 syscall
  9. 9. (Then) Why node? - Fast I/O - Fast boot time - Great at optimizing itself - C++ and Rust add-ons - Tooling - Productivity - JS is the new Lingua franca: desktop, bots, browsers, serverless, backend, IoT and recently even ML
  10. 10. How is Couchbase used? - 200 million documents across multiple clusters, 33 million of them live in a single bucket - Mixed usage of cache only and cache + N1QL - Three platforms: Java, Go and NodeJS - +400,000 ops/s in all clusters, 200k of them come from a single bucket - Up to 6,000 queries per second in N1QL
  11. 11. Our story begins with a sad day... 10 minutes At 95+% 800 qps 600 qps Response time 3 secs.
  12. 12. How can I find the root cause?
  13. 13. Warm Up tips
  14. 14. Advice #1 List at least all data servers in your connection string, that will allow the service to boot even if one of the nodes is down, also don’t forget to set your timeout const conn_str = 'couchbase://mycb1,mycb2,mycb3,mycb4'; const cluster = new Couchbase.Cluster(conn_str); cluster.authenticate('user', 'password'); bucket = cluster.openBucket('mybucket', (err) => { if (err) { throw err; } console.log('connected to couchbase'); }); bucket.operationTimeout = 5000;
  15. 15. Advice #2 Keep control of resources availability and only allow incoming traffic until the resources are available: const Events = require('events'); const couchbase = require('couchbase'); class Service extends Events { constructor() { super(); const cluster = new couchbase.Cluster('couchbase://cb1,cb2'); cluster.authenticate('user', 'pass'); this.bucket = cluster.openBucket('mybucket', (err) => { if (err) { this.emit('couchbase_error', err); } else { this.emit('couchbase_ready'); } }); } } let serviceAvailable = false; const service = new Service(); service.on('couchbase_error', (err) => { console.error(err); serviceAvailable = false; }); service.on('couchbase_ready', () => { serviceAvailable = true; }); app.get('/health', (req, res) => { const statusCode = (serviceAvailable === true) ? 200 : 500; res.status(statusCode).send(); }); Service definition REST definition
  16. 16. Advice #3 Do not abuse document size because is NoSQL, there are other associated costs that can affect: Network I/O Serialization Compression
  17. 17. Advice #4 For applications with a huge writing rate, where a single document can have multiple concurrency, try to avoid duplicate multiple operations. If you are ok with the first request. const inProgressWrites = new Set(); function saveDocument(key, document) { if (!inProgressWrites.has(key)) { inProgressWrites.add(key); this.bucket.insert(key, document, (err) => { inProgressWrites.delete(key); if(err) { // handle error } }); } } // Will try to save the document saveDocument('la_ca', { city: 'Los Angeles' }); // Will be completely ignored saveDocument('la_ca', { city: 'Los Angeles' });
  18. 18. Advice #5 No primary key in production create primary index ‘my-primary-index’ on ‘mybucket’ using gsi; 😭
  19. 19. Indexing
  20. 20. It’s all about data sets channelId=’132’ startTime=’<now>’ and endTime=’<now+14 days’ startTime=’< now>’ and endTime=’<n ow+2 hours’ Select x, y, z from <bucket> where <smaller_subset> and <second_subset> using <hinted_index> Query plan cycle: 1. Evaluate query 2. Fetch the indexed fields matching the 1st criteria in the index 3. Fetch the documents from step 2 4. Apply all other filters 5. Deliver results
  21. 21. Queries ● Could you use the K-V pair API instead? ● Always page your results (it might be useful to use a timestamp metadata attribute) ● Set a timeout ● For large datasets when the volume of data is significant, stream it ● Use the right consistency level dependending based on the performance/accuracy required: ○ Not bounded. Fastest, but it will only retrieved what it has been indexed and available in memory ○ At plus. Read your own write ○ Request plus. It requires to wait for all mutations to complete until the moment of the query execution ● Load test and profile
  22. 22. Select Avoid “select *” ● Increases response size ● Doesn’t take advantage of covering indexes ● Avoid using a reflected ID. To look for multiple IDs use the ‘use keys’ N1QL operation.
  23. 23. Three queries… one index // My Index // create index byTimeChannel on cbperf(`start`, `end`, `channelId`) // The simple way select * from cbperf where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424' // Reducing payload… also reduces: network usage, serialization time and in general blocking time. It is // actually slower with small documents or small datasets select `start`, `end`, `channelId`, `title` from cbperf where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424' // Covering index select `start`, `end`, `channelId` from cbperf where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424' // Covering index 2. Uses covering index to then fetch the documents select RAW d2 from (select raw meta().id from cbperf where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424') as d1 join cbperf as d2 on keys d1
  24. 24. We just reduced the execution time by ~3x, Now we are good at using Couchbase!!!
  25. 25. Covering index returning only required columns Covering Index: All the required data is already part of the index, therefore, there is no need to do anything else select `start`, `end`, `channelId` from cbperf where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424' create index byTimeChannel on cbperf(`start`, `end`, `channelId`) Query plan cycle: 1. Evaluate query 2. Fetch the indexed fields matching the 1st criteria in the index all the conditions 3. Fetch the documents from step 2 4. Apply all other filters 5. Deliver results
  26. 26. Covering index returning additional columns Covering Index: A sub query will return only the IDs of the documents that match the entire index With the result set, we execute a join that will bring only those documents and not only the ones that matched the first criteria in the index We save the manual filtering select RAW d2 from (select raw meta().id from cbperf where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424') as d1 join cbperf as d2 on keys d1 create index byTimeChannel on cbperf(`start`, `end`, `channelId`) Query plan cycle: 1. Evaluate query 2. Fetch the indexed fields matching the 1st criteria in the index all the conditions 3. Fetch the documents from step 2 4. Apply all other filters 5. Deliver results
  27. 27. Can we do better?...
  28. 28. Drop that ugly index… drop index cbperf.byTimeChannel # from: create index byTimeChannel on cbperf(`start`, `end`, `channelId`) Let’s play with a different one create index byTimeChannel on cbperf(`channelId`, `start`, `end`)
  29. 29. And we test...
  30. 30. Comparing best cases Fully covered optimized index 198 vs 113 ms Covered index saved in ~43% the original time Covered optimized index 2 198 vs 189 ms Covered index saved in ~5% the original time
  31. 31. Why? Hint… size matters select count(*) as total from cbperf where `start`>1509548757903 and `start`<1509615358273 # 9131 select count(*) as total from cbperf where channelId = 'CHAN_424' # 348 start>15… And start<15... Events Chan_ 424 Total Universe Phase 1 Index 1 Phase 1 Index 2 Find channel x in each 9131 records Find start x in each 348 records
  32. 32. Warning!!! Conditions can easily change on the same data set: // A shorter period of time select count(*) as total from cbperf where `start`>1509548757903 and `start`<1509548787903 # 15 // If instead of storing 14 days of data, now we store 1 month select count(*) as total from cbperf where channelId = 'CHAN_424' # ~700 Now we have a mix of cases where a query for a short period of time (2 hours to render the PG grid) can have less items than the total of the channel and vice versa so, our index swaps its efficient behavior
  33. 33. How to decide which index to use? :’(
  34. 34. Since you asked... create index byTimeChannel on cbperf(`start`, `end`, `channelId`) create index byChannelTime on cbperf(`channelId`, `start`, `end`) select `start`, `end`, `channelId`, title from cbperf use index (byChannelTime using gsi) where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424' select `start`, `end`, `channelId`, title from cbperf use index (byTimeChannel using gsi) where `start`>1509548757903 and `start`<1509615358273 and channelId = 'CHAN_424' You are now a Couchbase Ninja… that has to add code to pick the best index at runtime
  35. 35. Profiling Couchbase 4.x Couchbase 5
  36. 36. { "#operator": "Sequence", "~children": [ { "#operator": "Authorize", "#stats": { "#phaseSwitches": 4, "execTime": "2.645µs", "kernTime": "22.562863ms", "servTime": "1.46282ms" }, "privileges": { "~children": [ { "#operator": "PrimaryScan", "#stats": { Key elements to look for { "plan": { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "IndexScan2", "covers": [ "cover (((`guide`.`summary`).`channelId`))", "cover (((`guide`.`summary`).`start`))", "cover (((`guide`.`summary`).`end`))", "cover ((meta(`guide`).`id`))" ], "index": "by_time_channel_2", 😀 😭
  37. 37. It all starts with a theory Different algorithms and designs are expected to have a different behavior
  38. 38. Couchbase Studio

×