Tokyo系列介绍(一)
Upcoming SlideShare
Loading in...5
×
 

Tokyo系列介绍(一)

on

  • 2,135 views

 

Statistics

Views

Total Views
2,135
Views on SlideShare
2,082
Embed Views
53

Actions

Likes
0
Downloads
25
Comments
0

3 Embeds 53

http://blog.hexnova.org 48
http://cache.baidu.com 3
http://www.zhuaxia.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Tokyo系列介绍(一) Tokyo系列介绍(一) Presentation Transcript

  • Tokyo 系列产品介绍(一)
    杨建东(jdyang)
    架构管理部(amo)
  • Tokyo Products
    • Tokyo Cabinet
    • database library
    • Tokyo Tyrant
    • database server
    • Tokyo Dystopia
    • full-text search engine
    • Tokyo Promenade
    • content management system
    • open source
    • released under LGPL
    • powerful, portable, practical
    • written in the standard C, optimized to POSIX
    applications
    Prome
    nade
    custom storage
    Tyrant
    Dystopia
    Cabinet
    file system
  • Tokyo Cabinet- database library -
    • modern implementation of DBM
    • key/value database(base on big file)
    • No limit in the length of key & value
    • Every key and value is serial bytes with variable length
    • Successor of QDBM
    • C99 and POSIX compatible, using Pthread, mmap, etc...
    • Is written in C, and provided as APIs of C, C++, Java, Perl, and Ruby
    • high performance
    • insert: 0.4 sec/1M records (2,500,000 qps)
    • search: 0.33 sec/1M records (3,000,000 qps)
  • Logical structure
  • TCHDB: Hash Database
    • static hashing
    • O(1) time complexity
    • separate chaining
    • binary search tree
    • balances by the second hash
    • free block pool
    • best fit allocation
    • dynamic defragmentation
    • combines mmap and pwrite/pread
    • saves calling system calls
    • compression
    • deflate(gzip)/bzip2/custom
    bucket array
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
    key
    value
  • 查找一个key的过程
  • 一次写入的具体过程
    Put
    数据库层的操作
    Put implement
    数据库写入一条记录的逻辑步骤
    Write into file
    文件系统层面的原理
  • put
    Access method_mutex(no limit while reading but only one writing is allowed)
    Hash key-buf index of bucket
    Hdb->async (flush the rec-pool to disk)
    2nd hash(using the 8-bit in the little end) to access 1/256 record_mutex(muti-read allowed, one write in a time)
  • Put implement
    Out the cache
    Bucket-index to get offset in the bucket
    Read the rec-head in offset 0f bucket
    If free-rec ,pread(hdb-fd, offset).
    Else bsearch(2nd-hash) the bucket to find free-rec
    Put the recordbody(depend on the type of the put)
    Ps: using mmap to avoid too much system call
  • Write it into file
    把记录按格式填入要一次写入磁盘的buff
    找到合适(或者合并相邻)空闲块给buff,递归写入
    然后修改所属桶的桶内偏移
    Free block pool 机制(fb数组(可持久化),用完再文件末尾追加)
    缓冲池机制(格式同磁盘格式,可直接写入,mmap实现)
  • Tc hash总结
    Bucket是通过mmap技术开在内存里面,所以第一次hash在bucket array足够大时能保证极大的读取速度。
    二次hash保证了一个bucket里面record组织的二叉树结构(注意不是avl,红黑树),没有平衡二叉树的操作(需要磁盘I/O)可能导致极端情况(二叉树 单链表),这是一个折衷。
    采用和unix内核类似的空闲块管理机制,也是空间效率和时间效率的一个折衷。
    几个关键点(bucket大小,内存缓冲区大小)取决于分配给tc的内存的大小,所以足够大的内存才能发挥tc &tt最大的性能
  • Restore from ulog
    Tcrdb restore
    lock rdb, fullfil the package, tcrdbsend
    Ttserver
    proc(log,register the callback), serstart
    Epoll (in the following slids)
    Tc ulog adb restore
    1. new a logreader(list the logfile, match the timestamp, lock the specific log file)
    2. reader(read logrecord from disk to mem, fullfil the struct)
    3. parse the log-struct , adbredo(call the tchfunction)
  • Thread Pool Model
    listen
    epoll/kqueue
    accept the client connection
    if the event is about the listener
    first of all, the listening socket is enqueued into
    the epoll queue
    queue
    epoll_ctl(add)
    accept
    queue back if keep-alive
    epoll_wait
    task manager
    epoll_ctl(del)
    queue
    enqueue
    move the readable client socket
    from the epoll queue to the task queue
    deque
    worker thread
    worker thread
    do each task
    worker thread
  • Ttserver
    Prepare sock-obj, task-obj,req-obj
    Log in every step(can choose the aio)
    Face to adb(logical)
    Database skeleton(对后台引擎api接口的封装,接入adb(abstract db),对server那层透明)
    Timeout control the rythm
    Tcrmgr可以热机备份tch文件到异地(不能写入)
  • Flare的基本架构
  • 压测一
    配置:
    20*1000 request/s 0~200k random/req受限于网卡(百兆) ,每秒12.5m传输速度
    cpu正常(10%左右等待io)
    内存写满之后性能下降,见下面图示
  • 文件大小增长图例
  • 每分钟数据库新增的记录数
  • 压测二
    20*1000 request/s 0~200 B random/req百兆网卡
    vmstat
    procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    2 0 192 1601956 95440 2357920 0 0 0 0 9145 27279 4 35 61 0
    0 3 192 1600676 95440 2359480 0 0 0 2118 4761 12918 2 16 70 12
    0 3 192 1600676 95440 2359480 0 0 0 2760 1499 1987 0 0 75 25
    0 3 192 1600676 95440 2359480 0 0 0 2790 1479 1889 0 0 80 20
    4 2 192 1600612 95440 2359480 0 0 0 3376 1852 3258 0 2 77 21
    0 3 192 1600036 95440 2360000 0 0 0 2786 2895 6415 1 6 76 17
    4 0 192 1597348 95440 2362860 0 0 0 46 9084 28138 4 34 61 1
    0 2 192 1595428 95444 2364676 0 0 0 1376 6557 19543 2 23 69 6
    0 3 192 1595364 95444 2364676 0 0 0 2672 1491 1982 0 0 68 32
    数据库内存到磁盘的flush完全由操作系统管理,每5秒一次更新
  • 压测二系统信息
    ifstat
    fstat
    eth0 eth1
    KB/s in KB/s out KB/s in KB/s out
    1.74 0.29 353.97 137.44
    1.32 0.14 106.34 43.38
    1.89 0.14 1006.86 381.34
    1.19 0.14 1037.20 391.11
    1.25 0.14 1018.08 385.08
    0.88 0.14 437.06 165.80
    1.51 0.14 1.33 0.83
    1.57 0.14 0.99 0.83
    free -m
    [root@zr-24 temp]# free -m
    total used free shared buffers cached
    Mem: 4048 2520 1527 0 93 2339
    -/+ buffers/cache: 88 3959
    Swap: 4996 0 4996
    当网络适配器不成为瓶颈时,内存成为最大瓶颈,可以看到前期的内存消耗是迅速的
  • 文件大小
  • 每分钟增长的记录
  • 压测三信息
    20*1000 request/s 0~2k random/req受限于网卡(百兆)
    cpu正常(10%左右等待io),
    ttserver -host 10.31.1.194 -port 20000 -thnum 128 -dmn -ulim 1024m -ulog /dong/ulog/ -log /dong/temp/test.log -pid /dong/temp/test.pid -sid 1 /dong/temp/test.tch#bnum=100000000#rnum=0#xmsiz=0
  • 网络传输状况
    [root@zr-24 dong]# ifstat
    eth0 eth1
    KB/s in KB/s out KB/s in KB/s out
    1.49 0.62 5788.39 474.26
    1.51 0.14 4638.71 379.73
    0.94 0.14 4.82 0.76
    1.82 0.14 19.52 2.68
    0.98 0.14 5.54 0.96
    1.07 0.14 3211.95 266.02
    1.07 0.14 3446.36 286.53
    1.57 0.14 24.42 3.44
    1.19 0.14 7.72 1.24
  • Server端系统状况
    [root@zr-24 dong]# vmstat 2
    procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    2 0 192 2249892 98624 1685756 0 0 3 160 13 12 0 1 94 4
    3 0 192 2227684 98644 1707576 0 0 0 0 11964 26959 3 16 81 0
    0 3 192 2226596 98648 1708612 0 0 0 11050 2061 2854 0 1 73 26
    0 2 192 2226532 98648 1708612 0 0 0 9826 1571 1911 0 0 74 26
    2 3 192 2215524 98664 1720036 0 0 0 13954 6948 14245 1 9 73 16
    0 3 192 2215524 98664 1720036 0 0 0 7192 1605 1775 0 0 68 32
    0 0 192 2199972 98684 1735356 0 0 0 2654 9186 20883 2 14 71 12
    0 3 192 2193124 98692 1741588 0 0 0 5282 4744 9726 1 6 74 19
    2 2 192 2175460 98708 1758992 0 0 0 6376 9705 24344 2 15 61 22
    0 2 192 2152740 98732 1781328 0 0 0 6990 12323 29449 3 20 62 16
  • 系统运行3个小时以后(内存饱和)
    [root@zr-24 ~]# ifstat
    eth0 eth1
    KB/s in KB/s out KB/s in KB/s out
    1.91 0.49 633.42 60.83
    0.88 0.14 254.84 25.39
    0.88 0.14 3.76 0.41
    1.25 0.14 134.25 19.47
    1.07 0.14 604.90 60.62
    0.88 0.14 569.50 58.87
    1.38 0.14 266.62 26.91
    1.13 0.14 3.84 0.83
  • 系统状况
    [root@zr-24 ~]# vmstat 2
    procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 192 35876 6752 3972808 0 0 14 181 0 17 0 1 93 6
    0 2 192 31972 6756 3976964 0 0 618 2414 2930 3713 0 1 83 15
    0 1 192 37412 6720 3971540 0 0 348 2188 2712 3844 0 1 79 19
    0 1 192 29540 6724 3979336 0 0 1772 0 4018 5203 0 2 87 10
    0 2 192 37604 6716 3971024 0 0 126 5110 1699 1847 0 0 86 14
    0 1 192 37348 6716 3971024 0 0 20 3186 1574 1620 0 0 85 15
    0 2 192 30564 6732 3978288 0 0 1142 1624 4027 6108 1 3 86 10
    0 3 192 30436 6732 3978288 0 0 14 4004 1445 1308 0 0 78 22
    0 1 192 29860 6736 3978544 0 0 1052 22 5386 8675 1 5 84 11
    0 2 192 36324 6736 3972304 0 0 370 3602 2203 2690 0 1 85 14
  • 每分钟记录数增长
  • 数据库文件大小
  • 压测总结
    设置文件off_t64可以使得一个tch文件大于2g,最大到9e
    Value越大,读写性能越低
    Ttserver端注意将文件句柄开到足够大,去支持20线程*1000请求/s
    本地测试瓶颈在io,远端测试瓶颈在网络(百兆网卡。。)
    物理内存越大,读写性能越高(mmap机制,和延迟写入)
    当物理内存耗尽(没有设置内存交换区)情况下,系统性能下降到一个不可以接受的区域
    Tch文件很小,基本和log文件(多个log文件之和)一样大
    磁盘写满时候,tt不崩溃,会缓慢调整空闲块,可以读但写入都miss
    对于ttserver的测试一定要在系统稳定之后开始评估
  • Tt & tc源代码体验
    底层90%能看懂,理解,上层设计反而没有完全明白为什么要这么做
    在测试的过程中反而能学到,悟到最多东西。
    Tc的性能和代码都是很优质的,相比之下tt的设计上就有点马虎(比如在mem flush就直接丢给操作系统,使得内存很容易成为瓶颈)
    Fs基本的设计思想都是类似的,可以参考下amazon dynamo那篇论文,可以触类旁通
  • 待补充内容:
    异地备份,failover的具体实现
    Tc & tt部署,使用经验分享
    Flare 怎么样接入tc,补足tt的缺陷
  • THANK YOU