CLSF 2010




            1
Agenda
• Photos
• Attendances Overview
• Key Topics
• Industry Talks




                         2
LSF 2010, Shanghai




                     3
LSF 2010 participant/companies
    Company   # of participant              Key background

  Intel       5                  Kernel performance, SSD, mem mgmt

  EMC         5                  Storage, file system
  Fujitsu     4                  IO Controller, btrfs
  Taobao      3                  Distributed storage, taobao server
  Novell      2                  Suse server, HA
  Oracle      2                  OCFS2 dev/test
  Baidu       2                  Baidu kernel optimization
  Canonical   2
  Redhat      1                  Network driver




                                                                      4
Key topics
 Topic                     Slides?   Description
 Page writeback            Y         Dirty page ratio limit
                                     Control process to write pages

 CFQ, IO controller        Y         CFQ introduction and further features
 BTRFS                     N         Memory consuming, fsck speed
 SSD/Block layer           Y         Block layer issues with SSD
 Kernel Tracing            N         ftrace
 VFS scalability           N         Multi-core challenges
 Kernel testing            Y         Intel kernel auto test framework
 Industrial talk: Taobao   Y         TFS, Tair
 Industrial talk: Baidu    N         The architecture of Baidu search system

 Industrial talk: EMC      N         FSCK


                                                                               5
Writeback - Wu Fengguang
• vmscan is a bottleneck
         decrease dirty ratio under memory pressure, so vmscan can less possibly find
dirty pages on page allocation.
• pageout(page) calls wirtepage() to write to disk, which is a performance killer
since it does random writes
         let flusher write. expand single 4K write to 4MB write. So more dirty pages are
reclaimed and flushed.
• balance_dirty_page() should not write: random write kills performance
         let flusher write and ask process to sleep. Three proposals:
         a). wait io completion: NFS bumpy completion, need smoother sleep method.
         b). sleep (dirtied *3/2 / write_bandwidth)
         c). sleep (dirtied / throttle_bandwidth)
• flusher default write size (4MB -> 128MB), will be dynamic in the future.
         Baidu's practice: SSD random write is really bad. For sequential write, increase
wb size (4MB -> 40MB) will get 120% SSD performance




                                                                                            6
Btrfs - Coly Li
• Has too much love from linux community
         two years ago > now
• Used in MeeGo Project
• Taobao plan to push industrial deployment in 2-3 years
         10T per data server in TFS cluster
         Use on SSD and SATA hybrid data server
         Metadata will be allocated on SSD and data on SATA.
• Dynamic data relocation with hot data tracking patch.
         For generic fs usage, need to deal with device mapper to get device speed
information.
• FSCK
         A difficult must. Currently assigned to Fujitsu.




                                                                                     7
SSD challenges - Li Shaohua
• Throughput: same issue as network
• Disk controller gap and big locks (queue locks & scsi locks)
• Interrupt related:
    a. smp affinity: single queue, one CPU to deal with irqs
    b. blk_iopoll: poll more req in one req
• Need hardware multiqueue
• CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue)
• Queue lock contention vs. cache lock contention
    See Andi Kleen's talk in Tokyo Linux Conf
• Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned




                                                                             8
VFS scalability- Ma Tao
•   With multi-cores, all global locks suck
•   Globle icache/dcache can be adapted to per-CPU
•   CFQ can be adapted to per-queue
•   The less global locks the better




                                                     9
Industry talk – Baidu (Cont.)
• Service types
  a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data
  node, and process 8-9 K queries per second.
           b). Distributed system: large files, sequential read/write.
           c). cache/KV storage: between a and b
           d). Web Server: CPU bound.
•   For a), read() sucks. Use mmap() to read blocks adhead to void
  kernel/userspace memory copy.
  mmap() can not use page cache LRU. Call readahead() after each mmap() to mark
  pages as read.
  mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead()
  and sync_readaheadv()
  With above, memory is now the bottleneck. Doing 10G+ MB read.
•   Google patch for reducing mm->mmap_sem hold time
  In do_page_fault(), drop mem->sem if page not found, then read it and get the lock
  again.




                                                                                         10
Industry talk – Baidu (Cont.)
• 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages
  each time
         Get better performance for sequential IO
         OCFS2 uses 1MB fs block size and 4K page size
• PCIE compress card + ECC




                                                                                  11
Industry Talk - Taobao
• TFS
• Tair uses an update server to record updates; apply updates to production
  system during mid-night
• A config server is used to minimize meta server workload
   A versioned bucket table is maintained by config server and stored in each data
  server. Client can manipulate data location with the bucket table returned by config
  server.
• Both TFS and Tair are open source projects now




                                                                                         12
Industry Talk - EMC
• Introduce the recovery methods and check methods use in the file systems
  from ext2 to btrFS
• Emphasis the importance of FSCK; Introduce the issues within FSCK when
  checking a huge file system; Collect the proposals to solve this problem
• pNFS learning notes




                                                                             13
Q&A




      14

CLFS 2010

  • 1.
  • 2.
    Agenda • Photos • AttendancesOverview • Key Topics • Industry Talks 2
  • 3.
  • 4.
    LSF 2010 participant/companies Company # of participant Key background Intel 5 Kernel performance, SSD, mem mgmt EMC 5 Storage, file system Fujitsu 4 IO Controller, btrfs Taobao 3 Distributed storage, taobao server Novell 2 Suse server, HA Oracle 2 OCFS2 dev/test Baidu 2 Baidu kernel optimization Canonical 2 Redhat 1 Network driver 4
  • 5.
    Key topics Topic Slides? Description Page writeback Y Dirty page ratio limit Control process to write pages CFQ, IO controller Y CFQ introduction and further features BTRFS N Memory consuming, fsck speed SSD/Block layer Y Block layer issues with SSD Kernel Tracing N ftrace VFS scalability N Multi-core challenges Kernel testing Y Intel kernel auto test framework Industrial talk: Taobao Y TFS, Tair Industrial talk: Baidu N The architecture of Baidu search system Industrial talk: EMC N FSCK 5
  • 6.
    Writeback - WuFengguang • vmscan is a bottleneck decrease dirty ratio under memory pressure, so vmscan can less possibly find dirty pages on page allocation. • pageout(page) calls wirtepage() to write to disk, which is a performance killer since it does random writes let flusher write. expand single 4K write to 4MB write. So more dirty pages are reclaimed and flushed. • balance_dirty_page() should not write: random write kills performance let flusher write and ask process to sleep. Three proposals: a). wait io completion: NFS bumpy completion, need smoother sleep method. b). sleep (dirtied *3/2 / write_bandwidth) c). sleep (dirtied / throttle_bandwidth) • flusher default write size (4MB -> 128MB), will be dynamic in the future. Baidu's practice: SSD random write is really bad. For sequential write, increase wb size (4MB -> 40MB) will get 120% SSD performance 6
  • 7.
    Btrfs - ColyLi • Has too much love from linux community two years ago > now • Used in MeeGo Project • Taobao plan to push industrial deployment in 2-3 years 10T per data server in TFS cluster Use on SSD and SATA hybrid data server Metadata will be allocated on SSD and data on SATA. • Dynamic data relocation with hot data tracking patch. For generic fs usage, need to deal with device mapper to get device speed information. • FSCK A difficult must. Currently assigned to Fujitsu. 7
  • 8.
    SSD challenges -Li Shaohua • Throughput: same issue as network • Disk controller gap and big locks (queue locks & scsi locks) • Interrupt related: a. smp affinity: single queue, one CPU to deal with irqs b. blk_iopoll: poll more req in one req • Need hardware multiqueue • CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue) • Queue lock contention vs. cache lock contention See Andi Kleen's talk in Tokyo Linux Conf • Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned 8
  • 9.
    VFS scalability- MaTao • With multi-cores, all global locks suck • Globle icache/dcache can be adapted to per-CPU • CFQ can be adapted to per-queue • The less global locks the better 9
  • 10.
    Industry talk –Baidu (Cont.) • Service types a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data node, and process 8-9 K queries per second. b). Distributed system: large files, sequential read/write. c). cache/KV storage: between a and b d). Web Server: CPU bound. • For a), read() sucks. Use mmap() to read blocks adhead to void kernel/userspace memory copy. mmap() can not use page cache LRU. Call readahead() after each mmap() to mark pages as read. mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead() and sync_readaheadv() With above, memory is now the bottleneck. Doing 10G+ MB read. • Google patch for reducing mm->mmap_sem hold time In do_page_fault(), drop mem->sem if page not found, then read it and get the lock again. 10
  • 11.
    Industry talk –Baidu (Cont.) • 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages each time Get better performance for sequential IO OCFS2 uses 1MB fs block size and 4K page size • PCIE compress card + ECC 11
  • 12.
    Industry Talk -Taobao • TFS • Tair uses an update server to record updates; apply updates to production system during mid-night • A config server is used to minimize meta server workload A versioned bucket table is maintained by config server and stored in each data server. Client can manipulate data location with the bucket table returned by config server. • Both TFS and Tair are open source projects now 12
  • 13.
    Industry Talk -EMC • Introduce the recovery methods and check methods use in the file systems from ext2 to btrFS • Emphasis the importance of FSCK; Introduce the issues within FSCK when checking a huge file system; Collect the proposals to solve this problem • pNFS learning notes 13
  • 14.
    Q&A 14