Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra at teads

1,712 views

Published on

Teads is #1 in Video Ads. Read how Teads handles up to ~1 million requests/s with Apache Cassandra. How do we tuned Cassandra servers and clients. What issues we faced during the last year. How do we provision our clusters. Which tools are used: Datadog for monitoring and alerting, Cassandra reaper, Rundeck, Sumologic, cassandra_snapshotter. Why do we need a fork.

Published in: Software
  • Be the first to comment

Cassandra at teads

  1. 1. Cassandra @ Lyon Cassandra Users Romain Hardouin - Cassandra architect @ Teads 2017-02-16
  2. 2. I. II. III. IV. V. VI. VII. VIII. Cassandra @ Teads About Teads Architecture Provisioning Monitoring & alerting Tuning Tools C’est la vie A light fork
  3. 3. I. About Teads
  4. 4. Teads is the inventor of native video advertising With inRead, an award-winning format* *IPA Media owner survey 2014, IAB recognized format
  5. 5. 27 offices in 21 countries 500+ Global employees 1.2B users Global reach 90+ R&D employees
  6. 6. Teads growth Tracking events
  7. 7. Advertisers (to name a few)
  8. 8. Publishers (to name a few)
  9. 9. II. Architecture
  10. 10. Custom C* 2.1.16  C* 3.0 jvm.options  C* 2.2 logback  Backports  Patches Apache Cassandra versionApache Cassandra version
  11. 11. UsageUsage Up to 940K qps: Writes vs Reads
  12. 12. TopologyTopology 2 regions: EU & US  3rd region APAC coming soon 4 clusters 7 DC 110 nodes  Up to 150 with temporary DCs HP server blades 1 cluster 18 nodes
  13. 13. AWS nodes
  14. 14. i2.2xlarge 8 vCPU 61GB 2 x 800 GB attached SSD in RAID0 c3.4xlarge 16 vCPU 30GB 2 x 160 GB attached SSD in RAID0 c4.4xlarge 16 vCPU 30GB EBS 3.4 TB + 1 TB AWS instance typesAWS instance types Tons of counters Big Data, wide rows Many billions keys, LCS with TTL
  15. 15. 20 x c4.4xlarge with SSD GP2  3.4 TB data 10,000 IOPS⇒ 16KB  1 TB commitlog 3,000 IOPS⇒ 16KB 25 tables: batch + real time Temporary DC Cheap storage, great for STCS Snapshots (S3 backup) No coupling between disks and CPU/RAM High latency => high I/O wait Throughput: 160 MB/s Unsteady performances More on EBS nodesMore on EBS nodes
  16. 16. Physical nodes
  17. 17. HP Apollo XL170R Gen9 12 CPU Xeon @ 2.60GHz 128 GB RAM 3 x 1,5 TB High-end SSD in RAID0 Hardware nodesHardware nodes For Big Data, supersedes EBS DC
  18. 18. DC/Cluster split
  19. 19. Instance type changeInstance type change 20 x i2.2xlarge 20 x c3.4xlarge Counters Cheaper and more CPUs Counters Rebuild DC X DC Y
  20. 20. Workload isolationWorkload isolation 20 x i2.2xlarge 20 x c3.4xlarge Counters + Big Data Counters 20 x c4.4xlarge Big Data EBS Step 1: DC split DC A DC B DC C Rebuild +
  21. 21. Workload isolationWorkload isolation 20 x c4.4xlarge Big Data EBS Step 2: Cluster split Big DataAWS Direct Connect
  22. 22. Data model
  23. 23. “KISS” principle No fancy stuff  No secondary index  No list/set/tuple  No UDT
  24. 24. III. Provisioning
  25. 25. Capistrano Chef→ Custom Cookbooks:  C*  C* tools  C* reaper  Datadog wrapper Chef provisioning to spawn a cluster NowNow
  26. 26. C* cookbook  michaelklishin/cassandra-chef-cookbook  + Teads custom wrapper Terraform + Chef provisioner FutureFuture
  27. 27. IV. Monitoring & alerting
  28. 28. PastPast OpsCenter (Free)
  29. 29. Turnkey dashboards Support is reactive Main metrics only Per host graphs  impossible with many hosts
  30. 30. Ring view More than monitoring Lots of metrics Still lacks some metrics Dashboard creation: no templates Agent is heavy Free version limitations:  Data stored in production cluster  Apache C* <= 2.1 only DataStax OpsCenter Free version (v5)
  31. 31. All metrics you want Dashboard creation ● Templating  TimeBoard vs ScreenBoard Graph creation  Aggregation, trend, rate, anomaly detection No turnkey dashboards yet  May change: TLP templates Additional fees if >350 metrics  We need to increase this limit for our use case
  32. 32. Now we can easily  Find outliers  Compare a node to average  Compare two DCs  Explore a node’s metrics  Create overview dashboards  Create advanced dashboards for troubleshooting
  33. 33. Datadog’s cassandra.yamlDatadog’s cassandra.yaml - include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count - include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile - include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize - include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued
  34. 34. ScreenBoardScreenBoard
  35. 35. TimeBoardTimeBoard alpha beta gamma delta epsilon zeta eta alpha
  36. 36. ExampleExample Hints monitoring during maintenance on physical nodes Storage Streaming
  37. 37. Datadog Alerting Down node Exceptions Commitlog size High latency High GC High IO Wait High Pendings Many hints Long thrift connections Clock out of sync Disk space … Don’t miss this one Don’t forget /
  38. 38. V. Tuning
  39. 39. Java 8 CMS G1→ cassandra-env.sh -Dcassandra.max_queued_native_transport_requests=4096 -Dcassandra.fd_initial_value_ms=4000 -Dcassandra.fd_max_interval_ms=4000
  40. 40. GC logs enabled -XX:MaxGCPauseMillis=200 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:G1HeapRegionSize=32m -XX:G1HeapWastePercent=25 -XX:InitiatingHeapOccupancyPercent=? -XX:ParallelGCThreads=#CPU -XX:ConcGCThreads=#CPU -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseCompressedOops jvm.optionsjvm.options -XX:HeapDumpPath= <dir with enough free space> -XX:ErrorFile= <custom dir> -Djava.io.tmpdir= <custom dir> -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch ... Backport from C* 3.0
  41. 41. num_tokens: 256 native_transport_max_threads: 256 or 128 compaction_throughput_mb_per_sec: 64 concurrent_compactors: 4 or 2 concurrent_reads: 64 concurrent_writes: 128 or 64 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 or 4 memtable_cleanup_threshold: 0.6, 0.5 or 0.4 memtable_flush_writers: 4 or 2 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 dynamic_snitch_badness_threshold: 2.0 internode_compression: dc AWS nodesAWS nodes Heap c3.4xlarge: 15 GB i2.2xlarge: 24 GB
  42. 42. EBS volume != disk compaction_throughput_mb_per_sec: 32 concurrent_compactors: 4 concurrent_reads: 32 concurrent_writes: 64 concurrent_counter_writes: 64 trickle_fsync_interval_in_kb: 1024 AWS nodesAWS nodes Heap c4.4xlarge: 15 GB
  43. 43. num_tokens: 8 initial_token: ... native_transport_max_threads: 512 compaction_throughput_mb_per_sec: 128 concurrent_compactors: 4 concurrent_reads: 64 concurrent_writes: 128 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 memtable_cleanup_threshold: 0.6 memtable_flush_writers: 8 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 Hardware nodesHardware nodes More on this laterHeap: 24 GB
  44. 44. Why 8 tokens? Better repair performance, important for Big Data Evenly distributed tokens, stored in a Chef data bag Hardware nodesHardware nodes ./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4 { "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503" } https://github.com/rhardouin/cassandra-scripts Watch out! Know the drawbacks
  45. 45. Small entries, lots of reads compression = { 'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } + nodetool scrub (few GB) CompressionCompression
  46. 46. Disabled on 2 small clusters dynamic_snitch: false Less hop count Dynamic SnitchDynamic Snitch
  47. 47. Client side latency Dynamic SnitchDynamic Snitch P95 P75 Mean
  48. 48. Which node to decommission? DownscaleDownscale
  49. 49. Clients
  50. 50. Scala apps DataStax driver wrapper Spark & Spark streaming DataStax Spark Cassandra Connector
  51. 51. DataStax driver policy LatencyAwarePolicy TokenAwarePolicy→ LatencyAwarePolicy  Hotspots due to premature nodes eviction  Needs thorough tuning and steady workload  We drop it TokenAwarePolicy  Shuffle replicas depending on CL
  52. 52. For cross-region scheduled jobs VPN between AWS regions 20 executors with 6GB RAM output.consistency.level = (LOCAL_)ONE output.concurrent.writes = 50 connection.compression = LZ4
  53. 53. Useless writes 99% of empty unlogged batches on one DC What an optimization!
  54. 54. VI. Tools
  55. 55. {Parallel SSH + cron} on steroids  Security  History who/what/when/why Output is kept CQL migration Rolling restart Nodetool or JMX commands Backup and snapshot jobs “Job Scheduler & Runbook Automation” We added a “comment” field
  56. 56. Scheduled range repair Segments: up to 20,000 for TB tables Hosted fork for C* 2.1 We will probably switch to TLP’s fork We do not use incremental repairs See fix in C* 4.0
  57. 57. cassandra_snapshotter  Backup on S3  Scheduled with Rundeck We created and use a fork  Some PR merged upstream  Restore PR still to be merged
  58. 58. Logs management "C* " and "out of sync" "C* " and "new session: will sync" | count ... Alerts on pattern "C* " and "[ERROR]" "C* " and "[WARN]" and not ( … ) ...
  59. 59. VII. C’est la vie
  60. 60. Cassandra issues & failures
  61. 61. OS reboot… seems harmless, right? Cassandra service enabled Want a clue? C* 2.0 + counters Upgrade to C* 2.1 was a relief Without any obvious reason
  62. 62. Upgrade 2.0 2.1→ LCS cluster suffered  High load  Pending compactions was growing Switch to off heap memtable  Less GC => less load Reduce clients load Better after sstables upgrade  Took days
  63. 63. Upgrade 2.0 2.1→ Lots of NTR “All time blocked” NTR queue undersized for our workload  128 (hard coded) We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096 NTR pool needs to be sized accordingly
  64. 64. After replacing nodes DELETE FROM system.peers WHERE peer = '<replaced node>'; Used by DataStax Driver for auto discovery
  65. 65. AWS issues & failures
  66. 66. When you have to shoot, shoot, don’t talk! The Good, the Bad and the Ugly
  67. 67. We’ll shoot your instance We shot your instance EBS volume latency spike EBS volume unreachable
  68. 68. SSDSSD
  69. 69. SSDSSD dstat -tnvlr
  70. 70. Hardware issues & failures
  71. 71. One SSD failed CPUs suddendly became slow on one server Smart Array Battery BIOS bug Yup, not a SSD...
  72. 72. VIII. A light fork
  73. 73. Why a fork? 1. Need to add a patch ASAP  High Blocked NTR CASSANDRA-11363  Require to deploy from source 2. Why not backport interesting tickets? 3. Why not add small features/fixes?  Expose tasks queue length via JMX CASSANDRA-12758 You betcha!
  74. 74. A tiny fork We keep it as small as possible to fit our needs Even smaller when we will upgrade to C* 3.0  Backports will be obsolete
  75. 75. « Hey, if your fork is tiny, is it that useful? »
  76. 76. One example: repair
  77. 77. Backport of CASSANDRA-12580 Paulo Motta: log2 vs ln
  78. 78. Get these info without DEBUG burden Original fix One-liner fix
  79. 79. The most impressive result for a set of tables:  Before: 23 days  With CASSANDRA-12580: 16 hours Longest repair for a table: 2,5 days  Impossible to repair this table before the patch  Fit in gc_grace_seconds
  80. 80. It was a critical fix for us It should have landed in 2.1.17 IMHO*  Repair is a mandatory operation in many use cases  Paulo already made the patch for 2.1  C* 2.1 is widely used [*] Full post: http://www.mail-archive.com/user@cassandra.apache.org/msg49344.html
  81. 81. « Why bother with backports? Why not upgrade to 3.0? »
  82. 82. Because C* is critical for our business We don’t need fancy stuff (SASI, MV, UDF, ...) We just want a rock solid scalable DB C* 2.1 does the job for the time being We p la n to up grade to C * 3 .0 i n 201 7 We wi l l do th o ro ugh te s ts ; - )
  83. 83. « What about C* 2.2? »
  84. 84. C* 2.2 has some nice improvements:  Boostrapping with LCS: send source sstable level 1  Range movement causes CPU & performance impact 2  Resumable bootstrap/rebuild streaming 3 [1] CASSANDRA-7460 [2] CASSANDRA-9258 [3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810 But migration path 2.2 3.0 is risky→  Just my opinion based on users mailing list  DSE never used 2.2
  85. 85. Questions?
  86. 86. Thanks! C* devs for their awesome work Zenika for hosting this meetup You for being here!

×