Building AuroraObjects
Who am I?
● Wido den Hollander (1986)
● Co-owner and CTO of a PCextreme B.V., a
dutch hosting company
● Ceph trainer and c...
PCextreme?
● Founded in 2004
● Medium-sized ISP in the Netherlands
● 45.000 customers
● Started as a shared hosting compan...
What is AuroraObjects?
● Under the name “Aurora” my hosting company
PCextreme B.V. has two services:
– AuroraCompute, a Cl...
The RADOS Gateway (RGW)
● Service objects using either Amazon's S3 or
OpenStack's Swift protocol
● All objects are stored ...
The RADOS Gateway
Our ideas
● We wanted to cache frequently accessed
objects using Varnish
– Only possible with anonymous clients
● SSL shou...
Varnish
● A caching reverse HTTP proxy
– Very fast
● Up to 100k requests/s
– Configurable using the Varnish Configuration
...
The Gateways
● SuperMicro 1U
– AMD Opteron 6200 series CPU
– 128GB RAM
● 20Gbit LACP trunk
● 4 nodes
● Varnish runs locall...
The Ceph cluster
● SuperMicro 2U chassis
– AMD Opteron 4334 CPU
– 32GB Ram
– Intel S3500 80GB SSD for OS
– Intel S3700 200...
Our problems
● When we cache Objects in Varnish, they don't
show up in the usage accounting of the RGW
– The HTTP request ...
Our solution: Logstash
● All requests go from Varnish into Logstash and
into ElasticSearch
– From ElasticSearch we do the ...
Our solution: Logstash
● All requests go from Varnish into Logstash and
into ElasticSearch
– From ElasticSearch we do the ...
LogStash and ElasticSearch
● varnishncsa → logstash → redis → elasticsearch
input {
pipe {
command => "/usr/local/bin/varn...
%{Bucket}o?
● With %{<header>}o you can display the output of the return
header <header>:
– %{Server}o: Apache 2
– %{Conte...
Usage accounting
● We only query RGW for storage usage and
also store that in ElasticSearch
● ElasticSearch is used for al...
Back to Ceph: CRUSHMap
● A good CRUSHMap design should reflect the
physical topology of your Ceph cluster
– All machines h...
The CRUSHMap
type 7 powerfeed
host ceph03 {
alg straw
hash 0
item osd.12 weight 1.000
item osd.13 weight 1.000
..
}
powerf...
The CRUSHMap
Testing the CRUSHMap
● With crushtool you can test your CRUSHMap
● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/c...
A summary
● We cache anonymously accessed objects with
Varnish
– Allows us to process thousands of requests per
second
– S...
Resources
● LogStash: http://www.logstash.net/
● ElasticSearch: http://www.elasticsearch.net/
● Varnish: http://www.varnis...
Upcoming SlideShare
Loading in …5
×

Building AuroraObjects- Ceph Day Frankfurt

2,007 views

Published on

Wido den Hollander, 42on.com

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,007
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
74
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Building AuroraObjects- Ceph Day Frankfurt

  1. 1. Building AuroraObjects
  2. 2. Who am I? ● Wido den Hollander (1986) ● Co-owner and CTO of a PCextreme B.V., a dutch hosting company ● Ceph trainer and consultant at 42on B.V. ● Part of the Ceph community since late 2009 – Wrote the Apache CloudStack integration – libvirt RBD storage pool support – PHP and Java bindings for librados
  3. 3. PCextreme? ● Founded in 2004 ● Medium-sized ISP in the Netherlands ● 45.000 customers ● Started as a shared hosting company ● Datacenter in Amsterdam
  4. 4. What is AuroraObjects? ● Under the name “Aurora” my hosting company PCextreme B.V. has two services: – AuroraCompute, a CloudStack based public cloud backed by Ceph's RBD – AuroraObjects, a public object store using Ceph's RADOS Gateway ● AuroraObjects is a public RADOS Gateway service (S3 only) running in production
  5. 5. The RADOS Gateway (RGW) ● Service objects using either Amazon's S3 or OpenStack's Swift protocol ● All objects are stored in RADOS, the gateway is just a abstraction between HTTP/S3 and RADOS
  6. 6. The RADOS Gateway
  7. 7. Our ideas ● We wanted to cache frequently accessed objects using Varnish – Only possible with anonymous clients ● SSL should be supported ● Storage between Compute and Objects services shared ● 3x replication
  8. 8. Varnish ● A caching reverse HTTP proxy – Very fast ● Up to 100k requests/s – Configurable using the Varnish Configuration Language (VCL) – Used by Facebook and eBay ● Not a part of Ceph, but can be used with the RADOS Gateway
  9. 9. The Gateways ● SuperMicro 1U – AMD Opteron 6200 series CPU – 128GB RAM ● 20Gbit LACP trunk ● 4 nodes ● Varnish runs locally with RGW on each node – Uses the RAM to cache objects
  10. 10. The Ceph cluster ● SuperMicro 2U chassis – AMD Opteron 4334 CPU – 32GB Ram – Intel S3500 80GB SSD for OS – Intel S3700 200GB SSD for Journaling – 6x Seagate 3TB 7200RPM drive for OSD ● 2Gbit LACP trunk ● 18 nodes ● ~320TB of raw storage
  11. 11. Our problems ● When we cache Objects in Varnish, they don't show up in the usage accounting of the RGW – The HTTP request never reaches RGW ● When a Object changes we have to purge all caches to maintain cache consistency – User might change a ACL or modify a object with a PUT request ● We wanted to make cached requests cheaper then non-cached requests
  12. 12. Our solution: Logstash ● All requests go from Varnish into Logstash and into ElasticSearch – From ElasticSearch we do the usage accounting ● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object ● We also store bucket storage usage in ElasticSearch so we have an average over the month
  13. 13. Our solution: Logstash ● All requests go from Varnish into Logstash and into ElasticSearch – From ElasticSearch we do the usage accounting ● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object ● We also store bucket storage usage in ElasticSearch so we have an average over the month
  14. 14. LogStash and ElasticSearch ● varnishncsa → logstash → redis → elasticsearch input { pipe { command => "/usr/local/bin/varnishncsa.logstash" type => "http" } } ● And we simply execute varnishncsa varnishncsa -F '%{VCL_Log:client}x %{VCL_Log:proto}x %{VCL_Log:authorization}x % {Bucket}o %m %{Host}i %U %b %s %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x'
  15. 15. %{Bucket}o? ● With %{<header>}o you can display the output of the return header <header>: – %{Server}o: Apache 2 – %{Content-Type}o: text/html ● We patched RGW (is in master) that it can optionally return the bucket name in the response: 200 OK Connection: close Date: Tue, 25 Feb 2014 14:42:31 GMT Server: AuroraObjects Content-Length: 1412 Content-Type: application/xml Bucket: "ceph" X-Cache-Hit: No ● 'rgw expose bucket = true' in ceph.conf returns Bucket
  16. 16. Usage accounting ● We only query RGW for storage usage and also store that in ElasticSearch ● ElasticSearch is used for all traffic accounting – Allows us to differentiate between cached and non-cached traffic
  17. 17. Back to Ceph: CRUSHMap ● A good CRUSHMap design should reflect the physical topology of your Ceph cluster – All machines have a single power supply – The datacenter has a A and B powercircuit ● We use a STS (Static Transfer Switch) to create a third powercircuit ● With CRUSH we store each replica on a different powercircuit – When a circuit fails, we loose 2/3 of the Ceph cluster – Each powercircuit has it's own switching / network
  18. 18. The CRUSHMap type 7 powerfeed host ceph03 { alg straw hash 0 item osd.12 weight 1.000 item osd.13 weight 1.000 .. } powerfeed powerfeed-a { alg straw hash 0 item ceph03 weight 6.000 item ceph04 weight 6.000 .. } root ams02 { alg straw hash 0 item powerfeed-a item powerfeed-b item powerfeed-c } rule powerfeed { ruleset 4 type replicated min_size 1 max_size 3 step take ams02 step chooseleaf firstn 0 type powerfeed step emit }
  19. 19. The CRUSHMap
  20. 20. Testing the CRUSHMap ● With crushtool you can test your CRUSHMap ● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/crushmap ● $ crushtool -i /tmp/crushmap --test --rule 4 --num-rep 3 –show- statistics ● This shows you the result of the CRUSHMap: rule 4 (powerfeed), x = 0..1023, numrep = 3..3 CRUSH rule 4 x 0 [36,68,18] CRUSH rule 4 x 1 [21,52,67] .. CRUSH rule 4 x 1023 [30,41,68] rule 4 (powerfeed) num_rep 3 result size == 3: 1024/1024 ● Manually verify those locations are correct
  21. 21. A summary ● We cache anonymously accessed objects with Varnish – Allows us to process thousands of requests per second – Saves us I/O on the OSDs ● We use LogStash and ElasticSearch to store all requests and do usage accounting ● With CRUSH we store each replica on a different power circuit
  22. 22. Resources ● LogStash: http://www.logstash.net/ ● ElasticSearch: http://www.elasticsearch.net/ ● Varnish: http://www.varnish-cache.org/ ● CRUSH: http://ceph.com/docs/master/ ● E-Mail: wido@42on.com ● Twitter: @widodh

×