SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
4.
WHAT
• “Kafka is a messaging system that was
originally developed at LinkedIn to
serve as the foundation for LinkedIn's
activity stream and operational data
processing pipeline.”
13年7月5⽇日星期五
5.
User cases
• Operational monitoring: real-time, heads-up
monitoring
• Reporting and Batch processing: load data into
a data warehouse or Hadoop system
13年7月5⽇日星期五
42.
Drawbacks
• All disk reads and writes will go through this
unified cache. This feature cannot easily be
turned off without using direct I/O, so even if
a process maintains an in-process cache of the
data, this data will likely be duplicated in OS
pagecache, effectively storing everything
twice.
13年7月5⽇日星期五
45.
If we use memory(JVM)
• The memory overhead of objects is very
high, often doubling the size of the data
stored (or worse).
• Java garbage collection becomes
increasingly sketchy and expensive as
the in-heap data increases.
13年7月5⽇日星期五
46.
cache size
• at least double the available cache by
having automatic access to all free
memory, and likely double again by
storing a compact byte structure rather
than individual objects. Doing so will
result in a cache of up to 28-30GB on a
32GB machine.
13年7月5⽇日星期五
47.
comparison
in-disk in-memory
GC
Initialization
Logic
no GC stop the world
stay warm even if
restarted
rebuilt slow(10min for
10GB) and cold cache
handle by OS handle by programs
13年7月5⽇日星期五
48.
Conclusion
• using the filesystem and relying on
pagecache is superior to maintaining an
in-memory cache or other structure
13年7月5⽇日星期五
49.
Go Extreme!
• Write to filesystem DIRECTLY!
• (In effect this just means that it is transferred
into the kernel's pagecache where the OS
can flush it later.)
13年7月5⽇日星期五
50.
Furthermore
• You can configure: every N messages or
every M seconds. It is to put a bound on
the amount of data "at risk" in the event
of a hard crash.
• Varnish use pagecache-centric design as
well.
13年7月5⽇日星期五
55.
BTree for Disk
• Disk seeks come at 10 ms a pop
• each disk can do only one seek at a time
• parallelism is limited
• the observed performance of tree
structures is often super-linear
13年7月5⽇日星期五
56.
Lock
• Page or row locking to avoid lock the
tree
13年7月5⽇日星期五
57.
Two Facts
• no advantage of driver density because
of the heavy reliance on disk seek
• need small (< 100GB) high RPM SAS
drives to maintain a sane ratio of data
to seek capacity
13年7月5⽇日星期五
59.
Feature
• One queue is one log file
• Operations is O(1)
• Reads do not block writes or each other
• Decouple with data size
• Retain messages after consumption
13年7月5⽇日星期五
61.
1. The operating system reads data from the disk
into pagecache in kernel space
2. The application reads the data from kernel
space into a user-space buffer
3. The application writes the data back into
kernel space into a socket buffer
4. The operating system copies the data from the
socket buffer to the NIC buffer where it is sent
over the network
13年7月5⽇日星期五
62.
zerocopy
• data is copied into pagecache exactly
once and reused on each consumption
instead of being stored in memory and
copied out to kernel space every time it
is read
13年7月5⽇日星期五
67.
Key point
• End-to-end: compress by producers and
de-compress by consumers
• Batch: compression aims to compress a
‘message set’
• Kafka supports GZIP and Snappy
protocols
13年7月5⽇日星期五
77.
Msg Format
• N byte message:
• If magic byte is 0
1. 1 byte "magic" identifier to allow format changes
2. 4 byte CRC32 of the payload
3. N - 5 byte payload
• If magic byte is 1
1. 1 byte "magic" identifier to allow format changes
2. 1 byte "attributes" identifier to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
3. 4 byte CRC32 of the payload
4. N - 6 byte payload
13年7月5⽇日星期五
78.
Log format on-disk
• On-disk format of a message
• message length : 4 bytes (value: 1+4+n)
• ‘magic’ value : 1 byte
• crc : 4 bytes
• payload : n bytes
• partition id and node id to uniquely identify a
message
13年7月5⽇日星期五
83.
Writes
• Append-write
• When rotate:
• M : M messages in a log file
• S : S seconds after last flush
• Durability guarantee: losing at most M
messages or S seconds of data in the
event of a system crash
13年7月5⽇日星期五
85.
Buffer Reads
• auto double buffer size
• you can specify the max buffer size
13年7月5⽇日星期五
86.
Offset Search
• Search steps:
1. locating the log segment file in which
the data is stored
2. calculating the file-specific offset from
the global offset value
3. reading from that file offset
• Simple binary in memory
13年7月5⽇日星期五
87.
Features
• Reset the offset
• OutOfRangeException(problem we
met)
13年7月5⽇日星期五
89.
Deletes
• Policy: N days ago or N GB
• Deleting while reading?
• a copy-on-write style segment list
implementation that provides
consistent views to allow a binary
search to proceed on an immutable
static snapshot view of the log
segments
13年7月5⽇日星期五