CLFS 2010


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CLFS 2010

  1. 1. CLSF 2010 1
  2. 2. Agenda• Photos• Attendances Overview• Key Topics• Industry Talks 2
  3. 3. LSF 2010, Shanghai 3
  4. 4. LSF 2010 participant/companies Company # of participant Key background Intel 5 Kernel performance, SSD, mem mgmt EMC 5 Storage, file system Fujitsu 4 IO Controller, btrfs Taobao 3 Distributed storage, taobao server Novell 2 Suse server, HA Oracle 2 OCFS2 dev/test Baidu 2 Baidu kernel optimization Canonical 2 Redhat 1 Network driver 4
  5. 5. Key topics Topic Slides? Description Page writeback Y Dirty page ratio limit Control process to write pages CFQ, IO controller Y CFQ introduction and further features BTRFS N Memory consuming, fsck speed SSD/Block layer Y Block layer issues with SSD Kernel Tracing N ftrace VFS scalability N Multi-core challenges Kernel testing Y Intel kernel auto test framework Industrial talk: Taobao Y TFS, Tair Industrial talk: Baidu N The architecture of Baidu search system Industrial talk: EMC N FSCK 5
  6. 6. Writeback - Wu Fengguang• vmscan is a bottleneck decrease dirty ratio under memory pressure, so vmscan can less possibly finddirty pages on page allocation.• pageout(page) calls wirtepage() to write to disk, which is a performance killersince it does random writes let flusher write. expand single 4K write to 4MB write. So more dirty pages arereclaimed and flushed.• balance_dirty_page() should not write: random write kills performance let flusher write and ask process to sleep. Three proposals: a). wait io completion: NFS bumpy completion, need smoother sleep method. b). sleep (dirtied *3/2 / write_bandwidth) c). sleep (dirtied / throttle_bandwidth)• flusher default write size (4MB -> 128MB), will be dynamic in the future. Baidus practice: SSD random write is really bad. For sequential write, increasewb size (4MB -> 40MB) will get 120% SSD performance 6
  7. 7. Btrfs - Coly Li• Has too much love from linux community two years ago > now• Used in MeeGo Project• Taobao plan to push industrial deployment in 2-3 years 10T per data server in TFS cluster Use on SSD and SATA hybrid data server Metadata will be allocated on SSD and data on SATA.• Dynamic data relocation with hot data tracking patch. For generic fs usage, need to deal with device mapper to get device speedinformation.• FSCK A difficult must. Currently assigned to Fujitsu. 7
  8. 8. SSD challenges - Li Shaohua• Throughput: same issue as network• Disk controller gap and big locks (queue locks & scsi locks)• Interrupt related: a. smp affinity: single queue, one CPU to deal with irqs b. blk_iopoll: poll more req in one req• Need hardware multiqueue• CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue)• Queue lock contention vs. cache lock contention See Andi Kleens talk in Tokyo Linux Conf• Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned 8
  9. 9. VFS scalability- Ma Tao• With multi-cores, all global locks suck• Globle icache/dcache can be adapted to per-CPU• CFQ can be adapted to per-queue• The less global locks the better 9
  10. 10. Industry talk – Baidu (Cont.)• Service types a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data node, and process 8-9 K queries per second. b). Distributed system: large files, sequential read/write. c). cache/KV storage: between a and b d). Web Server: CPU bound.• For a), read() sucks. Use mmap() to read blocks adhead to void kernel/userspace memory copy. mmap() can not use page cache LRU. Call readahead() after each mmap() to mark pages as read. mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead() and sync_readaheadv() With above, memory is now the bottleneck. Doing 10G+ MB read.• Google patch for reducing mm->mmap_sem hold time In do_page_fault(), drop mem->sem if page not found, then read it and get the lock again. 10
  11. 11. Industry talk – Baidu (Cont.)• 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages each time Get better performance for sequential IO OCFS2 uses 1MB fs block size and 4K page size• PCIE compress card + ECC 11
  12. 12. Industry Talk - Taobao• TFS• Tair uses an update server to record updates; apply updates to production system during mid-night• A config server is used to minimize meta server workload A versioned bucket table is maintained by config server and stored in each data server. Client can manipulate data location with the bucket table returned by config server.• Both TFS and Tair are open source projects now 12
  13. 13. Industry Talk - EMC• Introduce the recovery methods and check methods use in the file systems from ext2 to btrFS• Emphasis the importance of FSCK; Introduce the issues within FSCK when checking a huge file system; Collect the proposals to solve this problem• pNFS learning notes 13
  14. 14. Q&A 14