Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling up with Aerospike!

547 views

Published on

LSPE presentation on Jun 14, 2014.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Scaling up with Aerospike!

  1. 1. Scaling up (and easing) operations at 1 Million TPS @ <1 ms latency. LSPE, Jun 14, 2014
  2. 2. Agenda of this talk ● Some types of Big Data? ● What are the problems that come with scale? ● What is the solution? (Or how Aerospike tackle these problem and how is Aerospike the solution for the above problems).
  3. 3. ● Anshu Prateek ● Aerospike Devops Lead ● Ex - Yahoo! Search Operations ● http://about.me/anshuprateek ● anshu@aerospike.com
  4. 4. Big Data Type ● Volume – Hadoop – PB / Hrs of jobs ● Variety – ETL – Many data sources, mashup, analyze ● Velocity – Do it fast, do it now! → Volume and Variety need Velocity to be useful.
  5. 5. What starts failing at scale? ● Machines / hardware ● Network ● Unplanned load ● Operator error
  6. 6. Big Data.. ● Volume – Hadoop – PB / Hrs of jobs ● Variety – ETL – Many data sources, mashup, analyze ● Velocity – Do it fast, do it now! → Volume and Variety need Velocity to be useful.
  7. 7. Velocity in Aerospike ● Latency Page SLA 700ms , Ads SLA 50 ms →Data store <5ms – Hybrid DRAM + SSD optimized storage ● Throughput – Horizontal scalability (Linear is desirable)
  8. 8. Prod example: ● 20 Nodes ● 1.6TB per node ● 50GB DRAM usage ● 14 Billion objects ● 70k TPS (r+w) per node peak
  9. 9. ● 98% of queries < 1ms ●
  10. 10. Yet another prod graph...
  11. 11. What starts failing at scale? ● Machines / hardware ● Network ● Unplanned load ● Operator error
  12. 12. Start scaling with Aerospike.. ● Machines / hardware – Replication / auto-balancing ● Network – Availability of islands – Auto balancing with eventual consistency ● Unplanned load – Have lot of headroom ● Operator error – What if the system reduces operational needs – Tools
  13. 13. Operational Ease ● Reducing initial setup time – Auto sharding – Auto cluster discovery ● Configuration – People don't read documents ● RTFM! – Good default value – retain the power to control when needed ● Static configs ● Dynamic configs
  14. 14. Tools ● Do all nodes have same config? – asmonitor -e 'compareconfig' ● Whats the cluster status? – asmonitor -e 'info' ● Oops, this needs to be changed! – asinfo -v 'set- config:context=service;letschangethis=value'
  15. 15. Tools ● Nagios ● Graphite ● AMC
  16. 16. Capacity Planning
  17. 17. Managing with AMC
  18. 18. Managing with AMC
  19. 19. Managing with AMC
  20. 20. Headroom! ● How many TPS can we do ?
  21. 21. ● 330 GCE ● 300 x 1TB ● Debian, Cassandra 2.2 ● Median Latency – 10.3 ms ● 95% < 23 ms
  22. 22. Aerospike

×