Your SlideShare is downloading. ×
0
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Couchbase_UK_2013_Couchbase_in_Production_II
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Couchbase_UK_2013_Couchbase_in_Production_II

587

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
587
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • In this session, we’re shifting gears from development to production. I’m going to talk about how to operate Couchbase in production – how to “care and feed” for the system to maintain application uptime and performance.I will try to demo as much as time permits – as this is a lot about practice.-This presentation will discuss the new features and production impact of 2.0, while most of this remains the same for 1.8 I will call out the specific differences as we come to them.
  • Calculate for both Active and number of replicas.Replicas will be the first to be dropped out of ram if not enough memory
  • Each on of these can determine the number of nodesData sets, work load, etc.
  • Calculate for both Active and number of replicas.Replicas will be the first to be dropped out of ram if not enough memory
  • Different applications, and even where the application is in its lifecycle, will lead to different required ratios between data in RAM and data only on disk (i.e. the working set to total set ratio will vary by application). We have three examples of very different working set to total dataset size ratios.
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • not unique to Couchbase…MySQL suffers as well for example
  • not unique to Couchbase…MySQL suffers as well for example
  • not unique to Couchbase…MySQL suffers as well for example
  • not unique to Couchbase…MySQL suffers as well for example
  • Replication is needed only for writes/updates. Gets are not replicated.
  • Replication is needed only for writes/updates. Gets are not replicated.
  • Chart shows average latency (response times) across varying document sizes (1KB – 16KB)It demonstrates that Couchbase Server is ridiculously fast and responds in microsecond responses. (That is latency is < 100 μsec on a 10gig Ethernet network for documents of all sizes)The network latency has an impact on a 1GIG Ethernet network, however latency is flat/ consistent on a 10GIG Ethernet networkCouchbase Server gives you a consistent, predictable latency at any document size
  • The more nodes you have the less impactful a failure of one node on the remaining nodes on the cluster.1 node is a single point of failure, obviously bad2 nodes gives you replication which is better, but if one node goes down, the whole load goes to just one node and now you’re at an spof3 nodes is the minimal recommendation because a failure of one distributes the load over twoThe more node the better, as recovering from a single node failure is easier with more nodes in the cluster
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • Each on of these can determined the number of nodesData sets, work load, etc.
  • Calculate for both Active and number of replicas.Replicas will be the first to be dropped out of ram if not enough memory
  • Add CPU
  • StatsStats timingDemo 3- Stats – check how we are on time.Load: -h localhost -i1000000 –M1000 –m9900 -t8 –K sharon –c500000 -l
  • StatsStats timingDemo 3- Stats – check how we are on time.Load: -h localhost -i1000000 –M1000 –m9900 -t8 –K sharon –c500000 -l
  • StatsStats timingDemo 3- Stats – check how we are on time.Load: -h localhost -i1000000 –M1000 –m9900 -t8 –K sharon –c500000 -l
  • StatsStats timingDemo 3- Stats – check how we are on time.Load: -h localhost -i1000000 –M1000 –m9900 -t8 –K sharon –c500000 -l
  • StatsStats timingDemo 3- Stats – check how we are on time.Load: -h localhost -i1000000 –M1000 –m9900 -t8 –K sharon –c500000 -l
  • Do not failover a healthy node!
  • Transcript

    1. Couchbase Server 2.0 in Production Part II Perry Krug Sr. Solutions Architect
    2. Node and cluster sizing
    3. Size Couchbase Server Sizing == performance • Serve reads out of RAM • Enough IO for writes and disk operations • Mitigate inevitable failures Reading Data Writing Data Application Server Application Server Give me Please storedocument A A document A Here is A OK, I stored document A document A Couchbase Server Couchbase Server
    4. How many nodes?5 Key Factors determine number of nodes needed:1) RAM2) Disk Application user3) CPU4) Network Web application server5) Data Distribution/Safety Couchbase Servers
    5. RAM sizing Reading DataKeep working set in RAM Application Serverfor best read performance Give me document A A Here is document A1) Total RAM: A• Managed document cache: • Working set A • Metadata • Active+Replicas• Index caching (I/O buffer) Server
    6. Working set ratio depends on your applicationworking/total set = .01 working/total set = .33 working/total set = 1 Couchbase Server Couchbase Server Couchbase ServerLate stage social game Business application Ad Network Many users no longer Users logged in during Any cookie can show upactive; few logged in at the day. Day moves at any time. any given time. around the globe.
    7. RAM sizing – Working set managed cacheAs memory grows, some cached data will be removedfrom RAM to make space: • Active and replica data share RAM • Threshold based (NRU, favoring active data) • Only cleanly persisted data can be “ejected” • Only data values can be “ejected” which means RAM can fill up with metadata
    8. RAM Sizing - View/Index cache (disk I/O)• File system cache availability for the index has a big impact performance:• Test runs based on 10 million items with 16GB bucket quota and 4GB, 8GB system RAM availability for indexes• Performance results show that by doubling system cache availability - query latency reduces by half - throughput increases by 50%• Leave RAM free with quotas
    9. Disk sizing: Space and I/O Writing Data Application Server2) Disk• Sustained write rate Please store document A A OK, I stored document A• Rebalance capacity• Backups A• XDCR I/O A• Compaction• Total dataset: (active + replicas + Space indexes)• Append-only Couchbase Server
    10. Disk sizing: I/OWhat impacts the disk I/O needs? • Peak write load • Sustained write load • Compaction • XDCR • Views/indexing Configurable paths/partitions for data and indexes allows for separation of space and I/O
    11. Disk sizing: SpaceWhat impacts the disk space needs? • Total data set • Indexes • Overhead for compaction (~3x): Both data and indexes are “append-only”Configurable paths/partitions for data and indexes allows for separation of space and I/O
    12. Disk sizing: Impact of Views on IO and Space• Number of Design Documents • Extra space for each DD • Extra IO to process for each DD • Segregate views by DD• Complexity of Views (IO)• Amount of view output (space) • Emit as little as possible • Doc ID automatically included• Use Development views and extrapolate
    13. Disk sizing: Append only• Append-only file format puts all new/updated/deleted items at the end of the on- disk file. - Better performance and reliability - No more fragmentation!• This can lead to invalidated data in the “back” of the file.• Need to compact data
    14. Disk compactionInitial file layout: Doc A Doc B Doc CUpdate some data: Doc A Doc B Doc C Doc A’ Doc B’ Doc D Doc A’’After compaction: Doc C Doc B’ Doc D Doc A’’
    15. Disk compaction• Compaction happens automatically: - Settings for “threshold” of stale data - Settings for time of day - Split by data and index files - Per-bucket or global• Reduces size of on-disk files – data files AND index files• Temporarily increased disk I/O and CPU, but no downtime!
    16. Tuning Compaction• “Space versus time/IO tradeoff”• 30% is default threshold, 60% found better for heavy writes…why?• Parallel compaction only if high CPU and disk IO available• Limit to off-hours if necessary
    17. CPU sizing3) CPU• Disk writing• Views/compaction/XDCR• RAM r/w performance not impacted1.8 used VERY little CPU. Under the sameworkloads, 2.0 should not be much different.New 2.0 features will require more CPU • Minimum requirements 4 cores • 1 core per DD
    18. Network sizing4) Network Reads+Writes• Client traffic• Replication (writes)• Rebalancing• XDCR Replication (multiply writes) and Rebalancing
    19. 10G vs. 1-GigE Consistently low latencies in microseconds for varying documents sizes with a mixed workloadhttp://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-708169.pdf
    20. Data Distribution Servers fail, be prepared. The more nodes, the less impact a failure will have.5) Data Distribution / Safety (assuming one replica):• 1 node = BAD• 2 nodes = …better…• 3+ nodes = BEST!Note: Many applications will need more than 3 nodes
    21. “Rebalance Overhead”• Disk: • Read from “source nodes” • Write to “destination nodes”• CPU: • Disk writing, compaction, index updates• RAM: • Working set maintained • Queuing of writes• Network: • Data transfer
    22. How many nodes? (recap)New to 2.0 feature will affect sizing requirements:• Views/Indexing/Querying• XDCR• Append-only file format5 Key Factors still determine number of nodes needed:1) RAM2) Disk Application user3) CPU4) Network Web application server5) Data Distribution Couchbase Servers
    23. Monitoring
    24. Key resources: RAM, Disk, Network, CPU Application Server Application Server Application Server NETWORK RAM RAM RAM DISK DISK DISK Server Server Server
    25. Monitoring – A few “realms”• Sizing/Growth tracking• Performance• Impending Doom• External to Couchbase
    26. Sizing/Growth• RAM: – Working set: Cache miss rate and disk fetches – Metadata usage• Disk: – Space – I/O: Disk Write queue size• CPU: – Indexing consistency• Network: – Internal replication lag: TAP queues – XDCR lag
    27. Performance• Most critical will be cache miss ratio• Use “timing stats” to spot check• OOM errors• Disk write failures
    28. Impending Doom• Memory over high watermark• Disk space full• Disk queue never going down• 100% CPU
    29. External to Couchbase• Nagios/Cacti/Zenoss/etc• “System” Monitoring: – RAM/Disk/CPU/Network• Gathering from Couchbase: – REST API: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase- admin-restapi.html – “cbstats”: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase- admin-cmdline-cbstats.html• Use statistics in “Summary” section of UI as starting place
    30. Monitoring a Rebalance• vBucket count on each node - When adding/removing nodes, vBucket counts will increase/decrease as vBuckets moved - During swap rebalance, new/old nodes will swap vBuckets - Active vbucket count should always be 1024 across cluster• Disk write queue on “destination” nodes - destination node in each vbucket movement will have higher DWQ - At ~1M items, destination node will throttle rebalance - Most common cause of rebalance slowness/stuck• Backfill queue on “source” nodes - This shows the data being moved between vBuckets - May show burst and one-off reads for data• RAM usage (always)
    31. Want more?Lots of details and best practices in our documentation: http://www.couchbase.com/docs/
    32. Thank you CouchbaseNoSQL Document Database perry@couchbase.com @couchbase

    ×