2. Who am I?
● Noah Watkins
● Red Hat engineer
○ Big Data applications on Ceph
● PhD candidate at UC Santa Cruz
○ Use Ceph as a research platform
● Contact
○ noahwatkins@gmail.com
2
3. Today’s Agenda
● Non-technical talk about technical stuff
● Less visible projects that deserve attention
● Ceph is a big ecosystem
○ Running Hadoop on Ceph
○ Tracing and debugging features
○ Custom object interfaces
3
6. Big Data with Ceph and Hadoop
● Do you Hadoop?
● Are you running a Ceph cluster?
6
7. Big Data with Ceph and Hadoop
● Do you Hadoop?
● Are you running a Ceph cluster?
● Combined, they work. End of talk.
National
System
Administrator
Appreciation
Day
7
15. Why should you care? Consolidation
15
FooStore!
FooApp! FooApp!
16. How does it work?
● A shim layer translates file system APIs
○ CephFS <-> Hadoop Common File System
● Opens up the entire Hadoop ecosystem
○ MapReduce
○ Spark
○ Storm
○ Impala
○ HBase
○ The list goes on and on
16
17. HDFS vs CephFS, 1TB Terasort
17
http://www.mellanox.com/related-docs/whitepapers/wp_hadoop_on_cephfs.pdf
1 Year Old
Results!
18. What Works and What Doesn’t (yet)
● Locality-aware scheduling
○ The rumors aren’t true :)
● Variable replication and erasure coding
○ Select from existing pools
● Snapshots
● How’s the stability
○ Terasort, HBase, DFSIO
○ Bug fixes and performance tuning (HDFS isn’t strict!)
○ Gets better with each new MDS update 18
19. Test driving Hadoop on Ceph
● Github
○ https://github.com/ceph/cephfs-hadoop
● Tutorial
○ http://ceph.com/docs/master/cephfs/hadoop/
○ Streamlined installation and new docs coming soon!
● Mailing list
○ Best resource right now
○ http://tracker.ceph.com
19
32. Trace processing pipeline
32
● Processing step examines trace events
● Typically written in Python
● Looking for pairs is a common pattern
○ Time spent in queue
○ Time spent in I/O
○ Client processing time
● Requires knowledge of internal workings
33. Zipkin, Blkin, and LTTng
33
● Dapper is a Google system
○ Traces causal
relationships
● Zipkin implemented by Twitter
○ Look at the pretty GUI
○ Ignores data sources
● Huge number of raw LTTng
tracepoints in Ceph
○ LTTng → Zipkin (Blkin)
■ Marios Kogias
○ Andrew Shewmaker
○ Adam Crume
34. Getting started with tracing!
● Lots of tracepoints exist!
○ Adding new points is easy :)
● RBD-Replay
○ Collect and replay RBD traces
○ http://ceph.com/docs/master/rbd/rbd-replay/
● Adding points and discussion
http://noahdesu.github.io/2014/06/01/tracing-ceph-with-
lttng-ust.html 34
TRACEPOINT_EVENT(librados, rados_write_enter,
TP_ARGS(
rados_ioctx_t, ioctx, const char*, oid,
const void*, buf, size_t, len, uint64_t, off),
TP_FIELDS(
ctf_integer_hex(rados_ioctx_t, ioctx, ioctx)
ctf_string(oid, oid)
)
)
36. A different version of a better talk
● The objects in RADOS can have arbitrary
code associated with them
○ Think: “remotely compress object “foo”, please.”
● "Distributed Storage and Compute with
Ceph’s librados”
○ Great talk by Sage Weil
○ Check out YouTube
● Scripting compute with librados
36
37. How does an OSD handle a request?
37
Client OSDlibrados
read-object(foo)
read-object
transaction
● Client reads an object from an OSD
● The OSD executes a “read” operation
○ The read operation knows how to
access data managed by the OSD
● All operations are executed in a
transactional context
● Exact function can be swapped out
38. What are RADOS object classes?
38
Client OSDlibrados
get-md5(foo)
transaction
● Client writes some C++ code
● Compiles it into an OSD plugin
● After installing this code can be invoked
● Avoids data transfer
○ Can cache results
● Can simplify application design
MD5-Hash
out-of-band install
39. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
39
40. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
40
Input provided by client, and
output returned to client.
41. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
41
Input provided by client, and
output returned to client.
Stat the object to query its size.
42. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
42
Input provided by client, and
output returned to client.
Stat the object to query its size.
Read the entire object into a
buffer.
43. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
43
Input provided by client, and
output returned to client.
Stat the object to query its size.
Read the entire object into a
buffer.
Pass this data buffer to the MD5
algorithm
44. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
44
Input provided by client, and
output returned to client.
Stat the object to query its size.
Read the entire object into a
buffer.
Pass this data buffer to the MD5
algorithm
Return the MD5 digest to the
client.
45. Example RADOS object class plugin
int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out)
{
size_t size;
int ret = cls_cxx_stat(hctx, &size, NULL);
if (ret < 0)
return ret;
bufferlist data;
ret = cls_cxx_read(hctx, 0, size, data);
if (ret < 0)
return ret;
byte digest[AES::BLOCKSIZE];
MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest));
return 0;
}
45
Input provided by client, and
output returned to client.
Stat the object to query its size.
Read the entire object into a
buffer.
Pass this data buffer to the MD5
algorithm
Return the MD5 digest to the
client.
All in a transactional context.
46. Dynamic object classes with Lua
46
Client OSDlibrados
call-lua(script, foo)
transaction
● Lua is great as an embedded language
● LuaJIT is a high-performance
implementation
● Allow clients to construct and modify
object classes without compiling or
restarting OSDs
LuaJIT VM
dynamically generated interface
47. Example: Lua Thumbnail Generator
function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
47
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply
ImageMagick transformation
● Append (cache) the new version of
the image to the object
● Save the location of the version
indexed by its specification
● Write a smart read function to
consult the cache
● Application can dynamically alter
the transformation applied
App-specific Object Interface
48. Example: Lua Thumbnail Generator
function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
48
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply
ImageMagick transformation
● Append (cache) the new version of
the image to the object
● Save the location of the version
indexed by its specification
● Write a smart read function to
consult the cache
● Application can dynamically alter
the transformation applied
App-specific Object Interface
49. Example: Lua Thumbnail Generator
function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
49
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply
ImageMagick transformation
● Append (cache) the new version of
the image to the object
● Save the location of the version
indexed by its specification
● Write a smart read function to
consult the cache
● Application can dynamically alter
the transformation applied
App-specific Object Interface
50. Example: Lua Thumbnail Generator
function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
50
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply
ImageMagick transformation
● Append (cache) the new version of
the image to the object
● Save the location of the version
indexed by its specification
● Write a smart read function to
consult the cache
● Application can dynamically alter
the transformation applied
App-specific Object Interface
51. Example: Lua Thumbnail Generator
function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
51
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply
ImageMagick transformation
● Append (cache) the new version of
the image to the object
● Save the location of the version
indexed by its specification
● Write a smart read function to
consult the cache
● Application can dynamically alter
the transformation applied
App-specific Object Interface
52. Getting started with scripted RADOS
● Buyer beware!
○ Experimental code
○ Works and fairly stable
● Code available on github
○ http://github.com/ceph/ceph
○ branch: cls-lua
● In-depth explanation and examples
○ http://ceph.com/rados/dynamic-object-interfaces-with-lua/
52
53. That’s it!
● Lot’s of interesting development
● Ceph is a great platform for experimentation
● Q&A
53