2. object model
● pools
– 1s to 100s
– independent namespaces or object collections
– replication level, placement policy
● objects
– bazillions
– blob of data (bytes to gigabytes)
– attributes (e.g., “version=12”; bytes to kilobytes)
– key/value bundle (bytes to gigabytes)
3. atomic transactions
● client operations send to the OSD cluster
– operate on a single object
– can contain a sequence of operations, e.g.
● truncate object
● write new object data
● set attribute
● atomicity
– all operations commit or do not commit atomically
● conditional
– 'guard' operations can control whether operation is performed
● verify xattr has specific value
● assert object is a specific version
– allows atomic compare-and-swap etc.
4. key/value storage
● store key/value pairs in an object
– independent from object attrs or byte data payload
● based on google's leveldb
– efficient random and range insert/query/removal
– based on BigTable SSTable design
● exposed via key/value API
– insert, update, remove
– individual keys or ranges of keys
● avoid read/modify/write cycle for updating complex
objects
– e.g., file system directory objects
5. watch/notify
● establish stateful 'watch' on an object
– client interest persistently registered with object
– client keeps session to OSD open
● send 'notify' messages to all watchers
– notify message (and payload) is distributed to all watchers
– variable timeout
– notification on completion
● all watchers got and acknowledged the notify
● use any object as a communication/synchronization
channel
– locking, distributed coordination (ala ZooKeeper), etc.
7. watch/notify example
● radosgw cache consistency
– radosgw instances watch a single object
(.rgw/notify)
– locally cache bucket metadata
– on bucket metadata changes (removal, ACL
changes)
● write change to relevant bucket object
● send notify with bucket name to other radosgw
instances
– on receipt of notify
● invalidate relevant portion of cache
8. rados classes
● dynamically loaded .so
– /var/lib/rados-classes/*
– implement new object “methods” using existing methods
– part of I/O pipeline
– simple internal API
● reads
– can call existing native or class methods
– do whatever processing is appropriate
– return data
● writes
– can call existing native or class methods
– do whatever processing is appropriate
– generates a resulting transaction to be applied atomically
9. class examples
● grep
– read an object, filter out individual records, and
return those
● sha1
– read object, generate fingerprint, return that
● images
– rotate, resize, crop image stored in object
– remove red-eye
● crypto
– encrypt/decrypt object data with provided key
10. ideas
● rados mailbox (RMB?)
– plug librados backend into dovecot, postfix, etc.
– key/value object for each mailbox
● key = message id
● value = headers
– object for each message or attachment
– watch/notify to delivery notification
11. ideas
● distributed key/value table
– aggregate many k/v objects into one big 'table'
– working prototype exists (thanks, Eleanor!)
12. ideas
● lua rados class
– embed lua interpreter in a rados class
– ship semi-arbitrary code for operations
● json class
– parse, manipulate json structures