Epoll - from the kernel side

15,744 views

Published on

Epoll - from the kernel side

Published in: Technology
0 Comments
40 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
15,744
On SlideShare
0
From Embeds
0
Number of Embeds
52
Actions
Shares
0
Downloads
313
Comments
0
Likes
40
Embeds 0
No embeds

No notes for slide
  • Picture borrow from Robert love - linux kernel development
  • Epoll - from the kernel side

    1. 1. Epoll - From the kernel side <ul><li>“ There are no secret messages in the source code. “ </li></ul><ul><li>Lijin Liu <llj098@gmail.com> </li></ul><ul><li>twiiter: @llj098 & http://blog.fatlj.me </li></ul>
    2. 2. Some basics about I/O <ul><li>Q: What is I/O? </li></ul><ul><li>A: The I/O is connecting the CPU to the outside world. </li></ul>
    3. 3. Some basics about I/O <ul><li>Three kinds of I/O: </li></ul><ul><li>memory-mapped input/output </li></ul><ul><li>I/O-mapped input/output </li></ul><ul><li>direct memory access (DMA) </li></ul>
    4. 4. Some basics about I/O <ul><li>PCI, ISA, EISA, NuBus.. </li></ul><ul><ul><li>PCI controller </li></ul></ul><ul><li>Interrupt Controller </li></ul><ul><ul><li>Also an I/O device </li></ul></ul><ul><ul><li>some device is able to communicate with it , and needless to talk with CPU </li></ul></ul><ul><li>POLLED I/O </li></ul><ul><ul><li>delay handle </li></ul></ul>
    5. 5. I/O models – back to software world <ul><li>- Blocking IO </li></ul><ul><ul><li>normal read/write/open... system call </li></ul></ul><ul><li>- NON-Blocking IO </li></ul><ul><ul><li>fcntl/ioctl </li></ul></ul><ul><li>IO-Mulitiplex </li></ul><ul><ul><li>SELECT </li></ul></ul><ul><li>- Event driven </li></ul><ul><ul><li>EPOLL/KQUEUE </li></ul></ul><ul><li>AIO </li></ul><ul><ul><li>IOCP </li></ul></ul>
    6. 6. NON/Blocking I/O <ul><li>- user space api/system call </li></ul><ul><ul><li>read,write,accept,open,close..  </li></ul></ul><ul><li>- Block IO </li></ul><ul><ul><li>per connection per thread/process </li></ul></ul><ul><li>- NONBlocking IO </li></ul><ul><ul><li>iotcl/fcntl </li></ul></ul><ul><ul><li>loop check </li></ul></ul>
    7. 7. IO-Multiplex  <ul><li>select/poll </li></ul><ul><li>Shortcomings </li></ul><ul><ul><li>fd number is limited </li></ul></ul><ul><ul><li>another type of loop check </li></ul></ul>
    8. 8. SELECT/POLL Internals - basics <ul><li>process : task_struct </li></ul><ul><li>No thread in linux, just process or ‘task’ </li></ul><ul><li>data structure </li></ul><ul><ul><li>include/linux/list.h </li></ul></ul><ul><li>- process scheduler  </li></ul><ul><ul><li>CFS  </li></ul></ul><ul><li>Process state machine </li></ul>
    9. 9. SELECT/POLL Internals - basics
    10. 10. SELECT/POLL Internals - basics <ul><li>sleep/wake up mechanism in linux kernel </li></ul><ul><ul><li>wait_queue </li></ul></ul><ul><li>structures: </li></ul><ul><li>struct __wait_queue { </li></ul><ul><li>unsigned int flags; </li></ul><ul><li>void *private; </li></ul><ul><li>            wait_queue_func_t func; /*callback function*/ </li></ul><ul><li>struct list_head task_list; </li></ul><ul><li>} </li></ul>
    11. 11. SELECT/POLL Internals – basics <ul><li>How to wait:  </li></ul><ul><li>linux/kernel/sched/core.c : schedule() -> __schedule():         </li></ul><ul><ul><ul><li>... </li></ul></ul></ul><ul><ul><ul><li>next = pick_next_task(rq); </li></ul></ul></ul><ul><ul><ul><li>... </li></ul></ul></ul><ul><ul><ul><li>context_switch(rq, prev, next); /* unlocks the rq */ </li></ul></ul></ul><ul><ul><ul><li>switch_mm(oldmm, mm, next); /*arch independent, x86: arch/x86/include/asm/mmu_context.h*/ </li></ul></ul></ul><ul><ul><ul><li>..... </li></ul></ul></ul>
    12. 12. SELECT/POLL Internals - some basics <ul><li>Interruption </li></ul><ul><ul><li>interrupt controller </li></ul></ul><ul><ul><ul><li>Device </li></ul></ul></ul><ul><ul><ul><li>programmable </li></ul></ul></ul><ul><ul><li>interrupt handler </li></ul></ul><ul><ul><ul><li>Often, the device driver will register as interupt handler </li></ul></ul></ul><ul><ul><li>Softirq </li></ul></ul><ul><ul><ul><li>bottom halves </li></ul></ul></ul><ul><ul><ul><li>ksoftirqd </li></ul></ul></ul>
    13. 13. SELECT/POLL Internals <ul><li>Why poll/select is not cool? </li></ul><ul><li>fs/select.c  do_select() :    for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) {                     ...                 file = fget_light(i, &fput_needed);                 if (file) {                     f_op = file->f_op;                     mask = DEFAULT_POLLMASK;                     if (f_op && f_op->poll)                         mask = (*f_op->poll)(file, retval ? NULL : wait);                          ...                                   }             }               </li></ul>
    14. 14. SELECT/POLL Internals -- tcp_poll <ul><li>- in the vfs part, the xxx_poll will not block </li></ul><ul><li>- net/ipv4/tcp.c : unsigned int tcp_poll() /* omit connect/close state chek */ </li></ul><ul><li>   ..... </li></ul><ul><li>if (tp->urg_seq == tp->copied_seq && </li></ul><ul><li>!sock_flag(sk, SOCK_URGINLINE) && </li></ul><ul><li>tp->urg_data) </li></ul><ul><li>target++; </li></ul><ul><li>  if (tp->rcv_nxt - tp->copied_seq >= target) </li></ul><ul><li>mask |= POLLIN | POLLRDNORM; </li></ul><ul><li>      .... </li></ul>
    15. 15. Here Comes the EPOLL <ul><li>User space API </li></ul><ul><ul><li>epoll_create() , epoll_ctl() , epoll_wait() </li></ul></ul><ul><li>structures </li></ul><ul><li>struct epoll_event { </li></ul><ul><li>uint32_t events; </li></ul><ul><li>epoll_data_t data; </li></ul><ul><li>};     </li></ul><ul><li>typedef union epoll_data { </li></ul><ul><li>void *ptr; </li></ul><ul><li>int fd; </li></ul><ul><li>uint32_t u32; </li></ul><ul><li>uint64_t u64; </li></ul><ul><li>} epoll_data_t; </li></ul><ul><li>- LT/ET mode </li></ul>
    16. 16. Epoll code demo  <ul><li>#define MAX_EVENTS 10 </li></ul><ul><li>struct epoll_event ev, events[MAX_EVENTS]; </li></ul><ul><li>int efd =  epoll_create (1024); </li></ul><ul><li>... </li></ul><ul><li>epoll_ctl (efd,EPOLL_CTL_ADD,listenfd,&ev); </li></ul><ul><li>... </li></ul><ul><li>while(1) { </li></ul><ul><li>int n =   epoll_wait (efd,events,MAX_EVENTS,-1);     </li></ul><ul><li>for(i = 0;i < n;i++){ </li></ul><ul><li>if(events[n].data.fd == listenfd) { </li></ul><ul><li>conn = accept(listenfd,(struct sockaddr *)addr,&addrlen); </li></ul><ul><li>setnonblocking(conn); </li></ul><ul><li>ev.events = EPOLLIN|  EPOLLET ; </li></ul><ul><li>ev.data.fd = conn; </li></ul><ul><li>epoll_ctl (efd,EPOLL_CTL_ADD,conn,&ev); </li></ul><ul><li>} </li></ul><ul><li>else{     </li></ul><ul><li>do_work(events[n].data.fd); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    17. 17. EPOLL Internals <ul><li>some structures </li></ul><ul><li>kernel side :    </li></ul><ul><ul><li>eventpoll        main structure,epoll_create() makes </li></ul></ul><ul><ul><li>epitem            wrap of a file, this struct in the RB tree </li></ul></ul><ul><ul><li>eppoll_entry    wait structure for poll hooks </li></ul></ul><ul><ul><li>epoll_event   same as user space </li></ul></ul><ul><li>user space:    </li></ul><ul><ul><li>epoll_event   like eventpoll above </li></ul></ul><ul><ul><li>epoll_data      custom data area </li></ul></ul>
    18. 18. EPOLL Internals <ul><li>why it works? </li></ul><ul><ul><li>kernel is event based, user space maybe not </li></ul></ul><ul><li>How it works? </li></ul><ul><ul><li>add the fd to the epoll by epoll_ctl() </li></ul></ul><ul><ul><li>use epoll_wait() sleep to fish active fds </li></ul></ul><ul><ul><li>the interruption happen </li></ul></ul><ul><ul><li>send the active fds to the user space </li></ul></ul><ul><ul><li>wake up the slept process </li></ul></ul>
    19. 19. EPOLL Internals <ul><li>add fd to epoll </li></ul><ul><li>fs/eventpoll.c: </li></ul><ul><ul><li>epoll_ctl() -> ep_insert() -> ep_rbtree_insert() </li></ul></ul><ul><li>when we add an fd to a eventpoll, first initilate corresponding structure:  epitem </li></ul><ul><li>setup some callback function for this file </li></ul><ul><li>add the  epitem to the rbtree  </li></ul>
    20. 20. EPOLL Internals - how to sleep <ul><li>two wait_queues </li></ul><ul><ul><li>one for the process right now </li></ul></ul><ul><ul><li>one for the ksoftirqd </li></ul></ul><ul><li>epoll_wait() system call </li></ul><ul><ul><li>set the current process to TASK_INTERUPPTABLE </li></ul></ul><ul><ul><li>schedule() </li></ul></ul>
    21. 21. epoll Internals - how to wakeup  <ul><li>Work flow </li></ul><ul><ul><li>Interrupt handler </li></ul></ul><ul><ul><ul><li>fd active </li></ul></ul></ul><ul><ul><ul><li>wait_queue #1 actived on ksoftirqd </li></ul></ul></ul><ul><ul><ul><li>epoll_callback() fired , active wait_queue #2 </li></ul></ul></ul><ul><ul><ul><li>copies the ready fds to the user space </li></ul></ul></ul><ul><ul><ul><li>set the user process running </li></ul></ul></ul><ul><ul><ul><li>user process is scheduled, wake up! </li></ul></ul></ul>
    22. 22. EPOLL Internal - show to wakeup  <ul><li>Tcp demo </li></ul><ul><li>-  cd net/ipv4/ </li></ul><ul><li>- af_inet: struct net_protocol tcp_protocol </li></ul><ul><li>- tcp_ipv4.c:tcp_v4_rcv() </li></ul><ul><li>- tcp_ipv4.c:tcp_v4_do_rcv() </li></ul><ul><li>- tcp_input.c:tcp_rcv_established() </li></ul><ul><li>-  cd ../core </li></ul><ul><li>- sock.c:sock->sk_data_ready() </li></ul><ul><li>- sock.c:sock->sock_def_readable() </li></ul><ul><li>- ep_poll_callback() : </li></ul><ul><li>    - add the fd to the epoll's ready list </li></ul><ul><li>    - active the blocked process above (by epoll_wait) </li></ul><ul><li>- after the blacked process wake: </li></ul><ul><li>    - ep_send_events() </li></ul><ul><li>        - ep_scan_ready_list() : copy the epoll's readylist to a tmp list(ref copy) </li></ul><ul><li>            - ep_send_events_proc() : transfer to user space </li></ul><ul><li>        - move the ovflist_list to the ready list of epoll </li></ul>
    23. 23. EPOLL the whole picture <ul><li>- two wait_queue, one for ksoftirqd, one for user process, one fire another </li></ul><ul><li>- three lock(two mutex,one spinlock) </li></ul><ul><li>- an ep_item red-black tree </li></ul>
    24. 24. Compare to the IOCP <ul><li>IOCP is AIO,EPOLL/KQUEUE is event base multiplexing </li></ul><ul><li>IOCP need to take care of the IO operation </li></ul><ul><li>EPOLL is just an notification mechanism, light, flexible </li></ul><ul><li>IOCP need a thread pool overhead </li></ul>
    25. 25. References <ul><li>Linus and kernel hackers - Linux kernel source tree </li></ul><ul><ul><li>http://kernel.org </li></ul></ul><ul><li>Robert Love - Linux Kernel Development </li></ul><ul><ul><li>http://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468/ </li></ul></ul><ul><li>Jonathan Corbet , Alessandro Rubini , Greg Kroah-Hartman – Linux Device Driver </li></ul><ul><ul><li>http://www.amazon.com/Linux-Device-Drivers-Jonathan-Corbet/dp/0596005903/ </li></ul></ul><ul><ul><li>Christian Benvenuti - Understanding the linux network internals </li></ul></ul><ul><ul><ul><li>http://www.amazon.com/Understanding-Network-Internals-Christian-Benvenuti/dp/0596002556/ </li></ul></ul></ul><ul><li>Randall Hyde (Author) - Write Great Code: Volume 1: Understanding the Machine </li></ul><ul><ul><ul><li>http://www.amazon.com/Write-Great-Code-Understanding-Machine/dp/1593270038 </li></ul></ul></ul><ul><li>W. Richard Stevens , Bill Fenner, Andrew M. Rudoff -Unix Network Programming, Volume 1 </li></ul><ul><ul><li>http://www.amazon.com/Unix-Network-Programming-Sockets-Networking/dp/0131411551 </li></ul></ul><ul><li>W. Richard Stevens , Stephen A. Rago - Advanced Programming in the UNIX Environment </li></ul><ul><ul><li>http://www.amazon.com/Programming-Environment-Addison-Wesley-Professional-Computing/dp/0321525949/ </li></ul></ul><ul><li>David A Rusling - The Linux Kernel   http://tldp.org/LDP/tlk/dd/interrupts.html </li></ul>
    26. 26. References <ul><li>- linux kernel 中 epoll 的设计和实现 </li></ul><ul><ul><li>http ://www.pagefault.info/? p=264 </li></ul></ul><ul><li>IOCP , kqueue , epoll ... 有多重要?  </li></ul><ul><ul><li>http ://blog.codingnow.com/2006/04/iocp_kqueue_epoll.html </li></ul></ul><ul><li>The linux kernel's interrupt controller API  </li></ul><ul><ul><li>http ://www.stillhq.com/pdfdb/000447/data.pdf </li></ul></ul><ul><li>mapped IO </li></ul><ul><ul><li>  http://en.wikipedia.org/wiki/Port-mapped_I/O </li></ul></ul><ul><li>wikepedia DMA </li></ul><ul><ul><li>  http://en.wikipedia.org/wiki/Direct_memory_access </li></ul></ul><ul><li>Improving (network) I/O performance   </li></ul><ul><ul><li> http://www.xmailserver.org/linux-patches/nio-improve.html </li></ul></ul>

    ×