20101030 clsf2010


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

20101030 clsf2010

  1. 1. 1 CLSF 2010
  2. 2. 2 Agenda • Photos • Attendances Overview • Key Topics • Industry Talks
  3. 3. 3 LSF 2010, Shanghai
  4. 4. 4 LSF 2010 participant/companies Company # of participant Key background Intel 5 Kernel performance, SSD, mem mgmt EMC 5 Storage, file system Fujitsu 4 IO Controller, btrfs Taobao 3 Distributed storage, taobao server Novell 2 Suse server, HA Oracle 2 OCFS2 dev/test Baidu 2 Baidu kernel optimization Canonical 2 Redhat 1 Network driver
  5. 5. 5 Key topics ftraceNKernel Tracing Topic Slides? Description Page writeback Y Dirty page ratio limit Control process to write pages CFQ, IO controller Y CFQ introduction and further features BTRFS N Memory consuming, fsck speed SSD/Block layer Y Block layer issues with SSD VFS scalability N Multi-core challenges Kernel testing Y Intel kernel auto test framework Industrial talk: Taobao Y TFS, Tair Industrial talk: Baidu N The architecture of Baidu search system Industrial talk: EMC N FSCK
  6. 6. 6 Writeback - Wu Fengguang • vmscan is a bottleneck decrease dirty ratio under memory pressure, so vmscan can less possibly find dirty pages on page allocation. • pageout(page) calls wirtepage() to write to disk, which is a performance killer since it does random writes let flusher write. expand single 4K write to 4MB write. So more dirty pages are reclaimed and flushed. • balance_dirty_page() should not write: random write kills performance let flusher write and ask process to sleep. Three proposals: a). wait io completion: NFS bumpy completion, need smoother sleep method. b). sleep (dirtied *3/2 / write_bandwidth) c). sleep (dirtied / throttle_bandwidth) • flusher default write size (4MB -> 128MB), will be dynamic in the future. Baidu's practice: SSD random write is really bad. For sequential write, increase wb size (4MB -> 40MB) will get 120% SSD performance
  7. 7. 7 Btrfs - Coly Li • Has too much love from linux community two years ago > now • Used in MeeGo Project • Taobao plan to push industrial deployment in 2-3 years 10T per data server in TFS cluster Use on SSD and SATA hybrid data server Metadata will be allocated on SSD and data on SATA. • Dynamic data relocation with hot data tracking patch. For generic fs usage, need to deal with device mapper to get device speed information. • FSCK A difficult must. Currently assigned to Fujitsu.
  8. 8. 8 SSD challenges - Li Shaohua • Throughput: same issue as network • Disk controller gap and big locks (queue locks & scsi locks) • Interrupt related: a. smp affinity: single queue, one CPU to deal with irqs b. blk_iopoll: poll more req in one req • Need hardware multiqueue • CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue) • Queue lock contention vs. cache lock contention See Andi Kleen's talk in Tokyo Linux Conf • Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned
  9. 9. 9 VFS scalability- Ma Tao • With multi-cores, all global locks suck • Globle icache/dcache can be adapted to per-CPU • CFQ can be adapted to per-queue • The less global locks the better
  10. 10. 10 Industry talk – Baidu (Cont.) • Service types a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data node, and process 8-9 K queries per second. b). Distributed system: large files, sequential read/write. c). cache/KV storage: between a and b d). Web Server: CPU bound. • For a), read() sucks. Use mmap() to read blocks adhead to void kernel/userspace memory copy. mmap() can not use page cache LRU. Call readahead() after each mmap() to mark pages as read. mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead() and sync_readaheadv() With above, memory is now the bottleneck. Doing 10G+ MB read. • Google patch for reducing mm->mmap_sem hold time In do_page_fault(), drop mem->sem if page not found, then read it and get the lock again.
  11. 11. 11 Industry talk – Baidu (Cont.) • 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages each time Get better performance for sequential IO OCFS2 uses 1MB fs block size and 4K page size • PCIE compress card + ECC
  12. 12. 12 Industry Talk - Taobao • TFS • Tair uses an update server to record updates; apply updates to production system during mid-night • A config server is used to minimize meta server workload A versioned bucket table is maintained by config server and stored in each data server. Client can manipulate data location with the bucket table returned by config server. • Both TFS and Tair are open source projects now
  13. 13. 13 Industry Talk - EMC • Introduce the recovery methods and check methods use in the file systems from ext2 to btrFS • Emphasis the importance of FSCK; Introduce the issues within FSCK when checking a huge file system; Collect the proposals to solve this problem • pNFS learning notes
  14. 14. 14 Q & A