SlideShare a Scribd company logo
Submit Search
Upload
Linux下Poll和Epoll内核源码剖析
Report
Share
Hao(Robin) Dong
Bigdata Architect at Alibaba Group
Follow
•
26 likes
•
3,222 views
1
of
19
Linux下Poll和Epoll内核源码剖析
•
26 likes
•
3,222 views
Report
Share
Download Now
Download to read offline
Technology
全文:http://donghao.org/2009/08/linuxiapolliepollaueouaeaeeio.html
Read more
Hao(Robin) Dong
Bigdata Architect at Alibaba Group
Follow
Recommended
Linux性能监控cpu内存io网络 by
Linux性能监控cpu内存io网络
lovingprince58
2.3K views
•
30 slides
Oracle dba必备技能 使用os watcher工具监控系统性能负载 by
Oracle dba必备技能 使用os watcher工具监控系统性能负载
maclean liu
895 views
•
37 slides
Coreseek/Sphinx 全文检索实践指南 by
Coreseek/Sphinx 全文检索实践指南
HonestQiao
1.1K views
•
42 slides
IoTDB Ops by
IoTDB Ops
JialinQiao
203 views
•
18 slides
Linux基础 by
Linux基础
zhuqling
29 views
•
38 slides
Install Oracle11g For Aix 5 L by
Install Oracle11g For Aix 5 L
heima911
562 views
•
13 slides
More Related Content
What's hot
Oprofile linux by
Oprofile linux
Feng Yu
3.6K views
•
16 slides
Redis 存储分片之代理服务twemproxy 测试 by
Redis 存储分片之代理服务twemproxy 测试
kaerseng
1.5K views
•
10 slides
Juniper ScreenOS 基于Policy的 by
Juniper ScreenOS 基于Policy的
mickchen
601 views
•
17 slides
Heartbeat+my sql+drbd构建高可用mysql方案 by
Heartbeat+my sql+drbd构建高可用mysql方案
cao jincheng
278 views
•
6 slides
ch9-pv1-the-extended-filesystem-family by
ch9-pv1-the-extended-filesystem-family
yushiang fu
212 views
•
43 slides
Linux by
Linux
zubin Jiang
616 views
•
42 slides
What's hot
(6)
Oprofile linux by Feng Yu
Oprofile linux
Feng Yu
•
3.6K views
Redis 存储分片之代理服务twemproxy 测试 by kaerseng
Redis 存储分片之代理服务twemproxy 测试
kaerseng
•
1.5K views
Juniper ScreenOS 基于Policy的 by mickchen
Juniper ScreenOS 基于Policy的
mickchen
•
601 views
Heartbeat+my sql+drbd构建高可用mysql方案 by cao jincheng
Heartbeat+my sql+drbd构建高可用mysql方案
cao jincheng
•
278 views
ch9-pv1-the-extended-filesystem-family by yushiang fu
ch9-pv1-the-extended-filesystem-family
yushiang fu
•
212 views
Linux by zubin Jiang
Linux
zubin Jiang
•
616 views
Viewers also liked
Epoll - from the kernel side by
Epoll - from the kernel side
llj098
17.7K views
•
26 slides
Capturando pacotes de rede no kernelspace by
Capturando pacotes de rede no kernelspace
Campus Party Brasil
876 views
•
18 slides
Data Structures used in Linux kernel by
Data Structures used in Linux kernel
assinha
6K views
•
11 slides
epoll() - The I/O Hero by
epoll() - The I/O Hero
Mohsin Hijazee
2.8K views
•
14 slides
Uma visão do mundo rails campus party 2011 - fabio akita by
Uma visão do mundo rails campus party 2011 - fabio akita
Campus Party Brasil
584 views
•
201 slides
Open Source Systems Performance by
Open Source Systems Performance
Brendan Gregg
33.4K views
•
45 slides
Viewers also liked
(20)
Epoll - from the kernel side by llj098
Epoll - from the kernel side
llj098
•
17.7K views
Capturando pacotes de rede no kernelspace by Campus Party Brasil
Capturando pacotes de rede no kernelspace
Campus Party Brasil
•
876 views
Data Structures used in Linux kernel by assinha
Data Structures used in Linux kernel
assinha
•
6K views
epoll() - The I/O Hero by Mohsin Hijazee
epoll() - The I/O Hero
Mohsin Hijazee
•
2.8K views
Uma visão do mundo rails campus party 2011 - fabio akita by Campus Party Brasil
Uma visão do mundo rails campus party 2011 - fabio akita
Campus Party Brasil
•
584 views
Open Source Systems Performance by Brendan Gregg
Open Source Systems Performance
Brendan Gregg
•
33.4K views
DevConf 2014 Kernel Networking Walkthrough by Thomas Graf
DevConf 2014 Kernel Networking Walkthrough
Thomas Graf
•
8.3K views
Browsing Linux Kernel Source by Motaz Saad
Browsing Linux Kernel Source
Motaz Saad
•
7K views
introduction to linux kernel tcp/ip ptocotol stack by monad bobo
introduction to linux kernel tcp/ip ptocotol stack
monad bobo
•
12.9K views
Socket System Calls by Avinash Varma Kalidindi
Socket System Calls
Avinash Varma Kalidindi
•
31.1K views
The linux networking architecture by hugo lu
The linux networking architecture
hugo lu
•
29.3K views
The TCP/IP Stack in the Linux Kernel by Divye Kapoor
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
•
52.3K views
Basic Linux Internals by mukul bhardwaj
Basic Linux Internals
mukul bhardwaj
•
41.3K views
Linux architecture by mcganesh
Linux architecture
mcganesh
•
49.8K views
Blazing Performance with Flame Graphs by Brendan Gregg
Blazing Performance with Flame Graphs
Brendan Gregg
•
323.6K views
Linux Performance Analysis: New Tools and Old Secrets by Brendan Gregg
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
•
603.9K views
Linux Systems Performance 2016 by Brendan Gregg
Linux Systems Performance 2016
Brendan Gregg
•
504.5K views
Broken Linux Performance Tools 2016 by Brendan Gregg
Broken Linux Performance Tools 2016
Brendan Gregg
•
822.8K views
Architecture Of The Linux Kernel by Dominique Cimafranca
Architecture Of The Linux Kernel
Dominique Cimafranca
•
26.7K views
Velocity 2015 linux perf tools by Brendan Gregg
Velocity 2015 linux perf tools
Brendan Gregg
•
1.1M views
Similar to Linux下Poll和Epoll内核源码剖析
Linux Tracing System 浅析 & eBPF框架开发经验分享 by
Linux Tracing System 浅析 & eBPF框架开发经验分享
happyagan
8 views
•
24 slides
Bigdata 大資料分析實務 (進階上機課程) by
Bigdata 大資料分析實務 (進階上機課程)
家雋 莊
509 views
•
44 slides
文件系统简述.pptx by
文件系统简述.pptx
GuoliangDing3
4 views
•
39 slides
Osc scott linux下的数据库优化for_postgresql by
Osc scott linux下的数据库优化for_postgresql
OpenSourceCamp
1.3K views
•
31 slides
Hadoop+spark實作 by
Hadoop+spark實作
FEG
952 views
•
45 slides
ch13-pv1-system-calls by
ch13-pv1-system-calls
yushiang fu
185 views
•
42 slides
Similar to Linux下Poll和Epoll内核源码剖析
(7)
Linux Tracing System 浅析 & eBPF框架开发经验分享 by happyagan
Linux Tracing System 浅析 & eBPF框架开发经验分享
happyagan
•
8 views
Bigdata 大資料分析實務 (進階上機課程) by 家雋 莊
Bigdata 大資料分析實務 (進階上機課程)
家雋 莊
•
509 views
文件系统简述.pptx by GuoliangDing3
文件系统简述.pptx
GuoliangDing3
•
4 views
Osc scott linux下的数据库优化for_postgresql by OpenSourceCamp
Osc scott linux下的数据库优化for_postgresql
OpenSourceCamp
•
1.3K views
Hadoop+spark實作 by FEG
Hadoop+spark實作
FEG
•
952 views
ch13-pv1-system-calls by yushiang fu
ch13-pv1-system-calls
yushiang fu
•
185 views
Device Driver - Chapter 3字元驅動程式 by ZongYing Lyu
Device Driver - Chapter 3字元驅動程式
ZongYing Lyu
•
1.4K views
More from Hao(Robin) Dong
Transformer and BERT by
Transformer and BERT
Hao(Robin) Dong
749 views
•
19 slides
Google TPU by
Google TPU
Hao(Robin) Dong
2.9K views
•
21 slides
flashcache原理及改造 by
flashcache原理及改造
Hao(Robin) Dong
514 views
•
32 slides
ext2-110628041727-phpapp02 by
ext2-110628041727-phpapp02
Hao(Robin) Dong
386 views
•
33 slides
Ext4 Bigalloc report public by
Ext4 Bigalloc report public
Hao(Robin) Dong
1.4K views
•
5 slides
Overlayfs and VFS by
Overlayfs and VFS
Hao(Robin) Dong
3.3K views
•
40 slides
More from Hao(Robin) Dong
(9)
Transformer and BERT by Hao(Robin) Dong
Transformer and BERT
Hao(Robin) Dong
•
749 views
Google TPU by Hao(Robin) Dong
Google TPU
Hao(Robin) Dong
•
2.9K views
flashcache原理及改造 by Hao(Robin) Dong
flashcache原理及改造
Hao(Robin) Dong
•
514 views
ext2-110628041727-phpapp02 by Hao(Robin) Dong
ext2-110628041727-phpapp02
Hao(Robin) Dong
•
386 views
Ext4 Bigalloc report public by Hao(Robin) Dong
Ext4 Bigalloc report public
Hao(Robin) Dong
•
1.4K views
Overlayfs and VFS by Hao(Robin) Dong
Overlayfs and VFS
Hao(Robin) Dong
•
3.3K views
Ext4 new feature - bigalloc by Hao(Robin) Dong
Ext4 new feature - bigalloc
Hao(Robin) Dong
•
1.4K views
why we need ext4 by Hao(Robin) Dong
why we need ext4
Hao(Robin) Dong
•
3.1K views
Kernel在多核机器上的负载均衡机制 by Hao(Robin) Dong
Kernel在多核机器上的负载均衡机制
Hao(Robin) Dong
•
709 views
Recently uploaded
[GDG Kaohsiung DevFest 2023] 以 Compose 及 Kotlin Multiplatform 打造多平台應用程式 by
[GDG Kaohsiung DevFest 2023] 以 Compose 及 Kotlin Multiplatform 打造多平台應用程式
Shengyou Fan
151 views
•
54 slides
AI Technology & Development of Civilization by
AI Technology & Development of Civilization
unclebrown017
44 views
•
74 slides
商品辨識定位系統_艾鍗學院-AIoT智能行動服務物聯網班 by
商品辨識定位系統_艾鍗學院-AIoT智能行動服務物聯網班
IttrainingIttraining
40 views
•
37 slides
居家雲端照護系統_艾鍗學院-AIoT智能行動服務物聯網班 by
居家雲端照護系統_艾鍗學院-AIoT智能行動服務物聯網班
IttrainingIttraining
43 views
•
32 slides
Hacking Facebook for fun and profit by Pranav Hivarekar by
Hacking Facebook for fun and profit by Pranav Hivarekar
Pranav Hivarekar
6 views
•
69 slides
AIoT 智能商店_艾鍗學院-AIoT智能行動服務物聯網班 by
AIoT 智能商店_艾鍗學院-AIoT智能行動服務物聯網班
IttrainingIttraining
43 views
•
25 slides
Recently uploaded
(6)
[GDG Kaohsiung DevFest 2023] 以 Compose 及 Kotlin Multiplatform 打造多平台應用程式 by Shengyou Fan
[GDG Kaohsiung DevFest 2023] 以 Compose 及 Kotlin Multiplatform 打造多平台應用程式
Shengyou Fan
•
151 views
AI Technology & Development of Civilization by unclebrown017
AI Technology & Development of Civilization
unclebrown017
•
44 views
商品辨識定位系統_艾鍗學院-AIoT智能行動服務物聯網班 by IttrainingIttraining
商品辨識定位系統_艾鍗學院-AIoT智能行動服務物聯網班
IttrainingIttraining
•
40 views
居家雲端照護系統_艾鍗學院-AIoT智能行動服務物聯網班 by IttrainingIttraining
居家雲端照護系統_艾鍗學院-AIoT智能行動服務物聯網班
IttrainingIttraining
•
43 views
Hacking Facebook for fun and profit by Pranav Hivarekar by Pranav Hivarekar
Hacking Facebook for fun and profit by Pranav Hivarekar
Pranav Hivarekar
•
6 views
AIoT 智能商店_艾鍗學院-AIoT智能行動服務物聯網班 by IttrainingIttraining
AIoT 智能商店_艾鍗學院-AIoT智能行動服務物聯網班
IttrainingIttraining
•
43 views
Linux下Poll和Epoll内核源码剖析
1.
Linux 下 poll
和 epoll 内核源码剖 析 作者:董昊
2.
epoll 简介 • 传统的
poll 在 fd 特别多时性能骤降 • epoll 是比 select 、 poll 更加高效的用于异 步事件处理的系统调用 • 作者 Davide Libenzi ( http://www.xmailserver.org/linux-patches/nio-im ) 已成为 linux 2.6 内核的标配
3.
s/select.c -->sys_poll] 456 asmlinkage
long sys_poll(struct pollfd __user * ufds, unsigned int nfds, long timeout) 457 { 458 struct poll_wqueues table; 459 int fdcount, err; 460 unsigned int i; 461 struct poll_list *head; 462 struct poll_list *walk; 463 464 /* Do a sanity check on nfds ... */ /* 用户给的 nfds 数不可以超过一个 struct file 结构支持的最大 fd 数(默认是 256 ) * 465 if (nfds > current->files->max_fdset && nfds > OPEN_MAX) 466 return -EINVAL; 467 468 if (timeout) { 469 /* Careful about overflow in the intermediate values */ 470 if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) 471 timeout = (unsigned long)(timeout*HZ+999)/1000+1; 472 else /* Negative or overflow */ 473 timeout = MAX_SCHEDULE_TIMEOUT; 474 } 475
4.
[fs/select.c -->sys_poll] 478 head
= NULL; 479 walk = NULL; 480 i = nfds; 481 err = -ENOMEM; 482 while(i!=0) { 483 struct poll_list *pp; 484 pp = kmalloc(sizeof(struct poll_list)+ 485 sizeof(struct pollfd)* 486 (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i), 487 GFP_KERNEL); 488 if(pp==NULL) 489 goto out_fds; 490 pp->next=NULL; 491 pp->len = (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i); 492 if (head == NULL) 493 head = pp; 494 else 495 walk->next = pp; 496 497 walk = pp; 498 if (copy_from_user(pp->entries, ufds + nfds-i, 499 sizeof(struct pollfd)*pp->len)) { 500 err = -EFAULT; 501 goto out_fds; 502 } 503 i -= pp->len; 504 } 505 fdcount = do_poll(nfds, head, &table, timeout);
5.
• 当用户传入的 fd
很多时,由于 poll 系统调 用每次都要把所有 struct pollfd 拷进内核, 所以参数传递和页分配此时就成了 poll 系 统调用的性能瓶颈。
6.
427 static int do_poll(unsigned int nfds, struct poll_list *list, 428 struct poll_wqueues *wait, long timeout) 429 { 430 int count = 0; 431 poll_table* pt = &wait->pt; 432 433 if (!timeout) 434 pt = NULL; 435 436 for (;;) { 437 struct poll_list *walk; 438 set_current_state(TASK_INTERRUPTIBLE); 439 walk = list; 440 while(walk != NULL) { 441 do_pollfd( walk->len, walk->entries, &pt, &count); 442 walk = walk->next; 443 } 444 pt = NULL; 445 if (count || !timeout || signal_pending(current)) 446 break; 447 count = wait->error; 448 if (count) 449 break; 450 timeout = schedule_timeout(timeout); /* 让 current
挂起,别的进程跑, timeout 到了以后再回来运行 current*/ 451 } 452 __set_current_state(TASK_RUNNING); 453 return count; 454 }
7.
[fs/select.c-->sys_poll()-->do_poll()] 395 static void do_pollfd(unsigned int num, struct pollfd * fdpage, 396 poll_table ** pwait, int *count) 397 { 398 int i; 399 400 for (i = 0; i < num; i++) { 401 int fd; 402 unsigned int mask; 403 struct pollfd *fdp; 404 405 mask = 0; 406 fdp = fdpage+i; 407 fd = fdp->fd; 408 if (fd >= 0) { 409 struct file * file = fget(fd); 410 mask = POLLNVAL; 411 if (file != NULL) { 412 mask = DEFAULT_POLLMASK; 413 if (file->f_op && file->f_op->poll) 414 mask = file->f_op->poll(file, *pwait); 415 mask &= fdp->events | POLLERR | POLLHUP; 416 fput(file); 417 } 418 if (mask) { 419 *pwait = NULL; 420 (*count)++; 421 } 422 } 423 fdp->revents = mask; 424 }
8.
• 注意标红的 440-443
行,当用户传入的 fd 很多时(比如 1000 个),对 do_pollfd 就 会调用很多次, poll 效率瓶颈的另一原因 就在这里。
9.
• 如果 fd
对应的是某个 socket , do_pollfd 调用的 就是网络设备驱动实现的 poll ;如果 fd 对应的是 某个 ext3 文件系统上的一个打开文件, 那 do_pollfd 调用的就是 ext3 文件系统驱动实现的 poll 。一句话,这个 file->f_op->poll 是设备驱动 程序实现的 • 其实,设备驱动程序的标准实现是:调用 poll_wait ,即以设备自己的等待队列为参数(通 常设备都 有自己的等待队列)调用 struct poll_table 的回调函数。
10.
epoll 原理 • 首先,每次
poll 都要把 1000 个 fd 拷入内核,太 不科学了,内核干嘛不自己保存已经拷入的 fd 呢 ?答对了, epoll 就是自己保存拷入的 fd ,它的 API 就已经说明了这一点——不是 epoll_wait 的 时候才传入 fd ,而是通过 epoll_ctl 把所有 fd 传 入内核再一起“ wait” ,这就省掉了不必要的重复 拷贝。 • 其次,在 epoll_wait 时,也不是把 current 轮流 的加入 fd 对应的设备等待队列,而是在设备等待 队列醒来时调用一个回调函数(当然,这就需要 “唤醒回 调”机制),把产生事件的 fd 归入一个链 表,然后返回这个链表上的 fd 。
11.
[fs/eventpoll.c-->evetpoll_init()] 1582 static int __init eventpoll_init(void) 1583 { 1584 int error; 1585 1586 init_MUTEX(&epsem); 1587 1588 /* Initialize the structure used to perform safe poll wait head wake ups */ 1589 ep_poll_safewake_init(&psw); ….. 1601 /* 1602 * Register the virtual file system that will be the source of inodes 1603 * for the eventpoll files 1604 */ 1605 error = register_filesystem(&eventpoll_fs_type); 1606 if (error) 1607 goto epanic; 1608 1609 /* Mount the above commented virtual file system */ 1610 eventpoll_mnt = kern_mount(&eventpoll_fs_type); 1611 error = PTR_ERR(eventpoll_mnt); 1612 if (IS_ERR(eventpoll_mnt)) 1613 goto epanic; 1614 1615 DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.n", 1616 current)); 1617 return 0; 1618 1619 epanic: 1620 panic("eventpoll_init() failedn"); 1621 }
12.
• 这个 module
在初始化时注册了一个新的文件系 统,叫“ eventpollfs” (在 eventpoll_fs_type 结构 里),然后挂载此文 件系统。另外创建两个内核 cache (在内核编程中,如果需要频繁分配小块 内存,应该创建 kmem_cahe 来做“内存池”) , 分 别用于存放 struct epitem 和 eppoll_entry 。 • 想想 epoll_create 为什么会返回一个新的 fd ?因 为它就是在这个叫做 "eventpollfs" 的文件系统里 创建了一个新文件
13.
• 为什么是个新文件系统? • linux
的系统调用有多少是返回指针的?你会发现 几乎没有!(特此强调, malloc 不是系统调用, malloc 调用的 brk 才是)因 为 linux 做为 unix 的 最杰出的继承人,它遵循了 unix 的一个巨大有点 —一切皆文件 • 使用文件系统有个好处: epoll_create 返回的是 一个 fd ,而不是该死的指针,指针如果指错了, 你简直没办法判断,而 fd 则可以通 过 current- >files->fd_array[] 找到其真伪。
14.
[fs/eventpoll.c-->sys_epoll_ctl()] 524 asmlinkage long 525 sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event) 526 { .... 575 epi = ep_find(ep, tfile, fd); 576 577 error = -EINVAL; 578 switch (op) { 579 case EPOLL_CTL_ADD: 580 if (!epi) { 581 epds.events |= POLLERR | POLLHUP; 582 583 error = ep_insert(ep, &epds, tfile, fd); 584 } else 585 error = -EEXIST; 586 break; 587 case EPOLL_CTL_DEL: 588 if (epi) 589 error = ep_remove(ep, epi); 590 else 591 error = -ENOENT; 592 break; 593 case EPOLL_CTL_MOD: 594 if (epi) { 595 epds.events |= POLLERR | POLLHUP; 596 error = ep_modify(ep, epi, &epds); 597 } else 598 error = -ENOENT; 599 break; 600 }
15.
• 看 ep_find
的调用方式, ep 参数应该是指向这个 “大结构”的指针,再看 ep = file->private_data • 具体看看 ep_find 的实现,发现原来是 struct eventpoll 的 rbr 成员( struct rb_root ),是一个 红黑树的根!而红黑树上挂的都是 struct epitem 。 • 现在清楚了,一个新创建的 epoll 文件带有一个 struct eventpoll 结构,这个结构上再挂一个红黑 树,而这个红黑树就是每次 epoll_ctl 时 fd 存放的 地方!
16.
[fs/eventpoll.c-->ep_ptable_queue_proc()] 883 static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, 884 poll_table *pt) 885 { 886 struct epitem *epi = EP_ITEM_FROM_EPQUEUE(pt); 887 struct eppoll_entry *pwq; 888 889 if (epi->nwait >= 0 && (pwq = PWQ_MEM_ALLOC())) { 890 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); 891 pwq->whead = whead; 892 pwq->base = epi; 893 add_wait_queue(whead, &pwq->wait); 894 list_add_tail(&pwq->llink, &epi->pwqlist); 895 epi->nwait++; 896 } else { 897 /* We have to signal that an error occurred */ 898 epi->nwait = -1; 899 } 900 }
17.
• 上面的代码就是 ep_insert
中要做的最重要的事: 创建 struct eppoll_entry ,设置其唤醒回调函数为 ep_poll_callback ,然后加入设备等待队列(注意 这里的 whead 就是上一章所说的每个设 备驱动 都要带的等待队列) • 只有这样,当设备就绪,唤醒等待队列上的等待 着时, ep_poll_callback 就会被调用。 poll 是自 己挨个去问每个 fd“ 好没有?”、“好没 有?“( poll ),而 epoll 是给了每个 fd 一个命令 “好了就调我的函数!”,然后就去睡了,等着设备 驱动通知它 ....epoll 的方法显然更高效
18.
为什么研究内核? • 要获得最高的服务器使用效能,就必须优 化所有软件,而操作系统它不过也就是一 个软件。 • 知其然,亦知其所以然
19.
• 本 ppt
全文 – http://donghao.org/2009/08/linuxiapolliepollau eouaeaeeio.html