4. WHAT
• “Kafka is a messaging system that was
originally developed at LinkedIn to
serve as the foundation for LinkedIn's
activity stream and operational data
processing pipeline.”
13年7月5⽇日星期五
5. User cases
• Operational monitoring: real-time, heads-up
monitoring
• Reporting and Batch processing: load data into
a data warehouse or Hadoop system
13年7月5⽇日星期五
42. Drawbacks
• All disk reads and writes will go through this
unified cache. This feature cannot easily be
turned off without using direct I/O, so even if
a process maintains an in-process cache of the
data, this data will likely be duplicated in OS
pagecache, effectively storing everything
twice.
13年7月5⽇日星期五
45. If we use memory(JVM)
• The memory overhead of objects is very
high, often doubling the size of the data
stored (or worse).
• Java garbage collection becomes
increasingly sketchy and expensive as
the in-heap data increases.
13年7月5⽇日星期五
46. cache size
• at least double the available cache by
having automatic access to all free
memory, and likely double again by
storing a compact byte structure rather
than individual objects. Doing so will
result in a cache of up to 28-30GB on a
32GB machine.
13年7月5⽇日星期五
48. Conclusion
• using the filesystem and relying on
pagecache is superior to maintaining an
in-memory cache or other structure
13年7月5⽇日星期五
49. Go Extreme!
• Write to filesystem DIRECTLY!
• (In effect this just means that it is transferred
into the kernel's pagecache where the OS
can flush it later.)
13年7月5⽇日星期五
50. Furthermore
• You can configure: every N messages or
every M seconds. It is to put a bound on
the amount of data "at risk" in the event
of a hard crash.
• Varnish use pagecache-centric design as
well.
13年7月5⽇日星期五
55. BTree for Disk
• Disk seeks come at 10 ms a pop
• each disk can do only one seek at a time
• parallelism is limited
• the observed performance of tree
structures is often super-linear
13年7月5⽇日星期五
56. Lock
• Page or row locking to avoid lock the
tree
13年7月5⽇日星期五
57. Two Facts
• no advantage of driver density because
of the heavy reliance on disk seek
• need small (< 100GB) high RPM SAS
drives to maintain a sane ratio of data
to seek capacity
13年7月5⽇日星期五
59. Feature
• One queue is one log file
• Operations is O(1)
• Reads do not block writes or each other
• Decouple with data size
• Retain messages after consumption
13年7月5⽇日星期五
61. 1. The operating system reads data from the disk
into pagecache in kernel space
2. The application reads the data from kernel
space into a user-space buffer
3. The application writes the data back into
kernel space into a socket buffer
4. The operating system copies the data from the
socket buffer to the NIC buffer where it is sent
over the network
13年7月5⽇日星期五
62. zerocopy
• data is copied into pagecache exactly
once and reused on each consumption
instead of being stored in memory and
copied out to kernel space every time it
is read
13年7月5⽇日星期五
67. Key point
• End-to-end: compress by producers and
de-compress by consumers
• Batch: compression aims to compress a
‘message set’
• Kafka supports GZIP and Snappy
protocols
13年7月5⽇日星期五
77. Msg Format
• N byte message:
• If magic byte is 0
1. 1 byte "magic" identifier to allow format changes
2. 4 byte CRC32 of the payload
3. N - 5 byte payload
• If magic byte is 1
1. 1 byte "magic" identifier to allow format changes
2. 1 byte "attributes" identifier to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
3. 4 byte CRC32 of the payload
4. N - 6 byte payload
13年7月5⽇日星期五
78. Log format on-disk
• On-disk format of a message
• message length : 4 bytes (value: 1+4+n)
• ‘magic’ value : 1 byte
• crc : 4 bytes
• payload : n bytes
• partition id and node id to uniquely identify a
message
13年7月5⽇日星期五
83. Writes
• Append-write
• When rotate:
• M : M messages in a log file
• S : S seconds after last flush
• Durability guarantee: losing at most M
messages or S seconds of data in the
event of a system crash
13年7月5⽇日星期五
85. Buffer Reads
• auto double buffer size
• you can specify the max buffer size
13年7月5⽇日星期五
86. Offset Search
• Search steps:
1. locating the log segment file in which
the data is stored
2. calculating the file-specific offset from
the global offset value
3. reading from that file offset
• Simple binary in memory
13年7月5⽇日星期五
89. Deletes
• Policy: N days ago or N GB
• Deleting while reading?
• a copy-on-write style segment list
implementation that provides
consistent views to allow a binary
search to proceed on an immutable
static snapshot view of the log
segments
13年7月5⽇日星期五