• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

20101030 clsf2010






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    20101030 clsf2010 20101030 clsf2010 Presentation Transcript

    • CLSF 2010 1
    • Agenda • Photos • Attendances Overview • Key Topics • Industry Talks 2
    • LSF 2010, Shanghai 3
    • LSF 2010 participant/companies Company # of participant Key background Intel 5 Kernel performance, SSD, mem mgmt EMC 5 Storage, file system Fujitsu 4 IO Controller, btrfs Taobao 3 Distributed storage, taobao server Novell 2 Suse server, HA Oracle 2 OCFS2 dev/test Baidu 2 Baidu kernel optimization Canonical 2 Redhat 1 Network driver 4
    • Key topics Topic Slides? Description Page writeback Y Dirty page ratio limit Control process to write pages CFQ, IO controller Y CFQ introduction and further features BTRFS N Memory consuming, fsck speed SSD/Block layer Y Block layer issues with SSD Kernel Tracing N ftrace VFS scalability N Multi-core challenges Kernel testing Y Intel kernel auto test framework Industrial talk: Taobao Y TFS, Tair Industrial talk: Baidu N The architecture of Baidu search system Industrial talk: EMC N FSCK 5
    • Writeback - Wu Fengguang • vmscan is a bottleneck decrease dirty ratio under memory pressure, so vmscan can less possibly find dirty pages on page allocation. • pageout(page) calls wirtepage() to write to disk, which is a performance killer since it does random writes let flusher write. expand single 4K write to 4MB write. So more dirty pages are reclaimed and flushed. • balance_dirty_page() should not write: random write kills performance let flusher write and ask process to sleep. Three proposals: a). wait io completion: NFS bumpy completion, need smoother sleep method. b). sleep (dirtied *3/2 / write_bandwidth) c). sleep (dirtied / throttle_bandwidth) • flusher default write size (4MB -> 128MB), will be dynamic in the future. Baidu's practice: SSD random write is really bad. For sequential write, increase wb size (4MB -> 40MB) will get 120% SSD performance 6
    • Btrfs - Coly Li • Has too much love from linux community two years ago > now • Used in MeeGo Project • Taobao plan to push industrial deployment in 2-3 years 10T per data server in TFS cluster Use on SSD and SATA hybrid data server Metadata will be allocated on SSD and data on SATA. • Dynamic data relocation with hot data tracking patch. For generic fs usage, need to deal with device mapper to get device speed information. • FSCK A difficult must. Currently assigned to Fujitsu. 7
    • SSD challenges - Li Shaohua • Throughput: same issue as network • Disk controller gap and big locks (queue locks & scsi locks) • Interrupt related: a. smp affinity: single queue, one CPU to deal with irqs b. blk_iopoll: poll more req in one req • Need hardware multiqueue • CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue) • Queue lock contention vs. cache lock contention See Andi Kleen's talk in Tokyo Linux Conf • Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned 8
    • VFS scalability- Ma Tao • With multi-cores, all global locks suck • Globle icache/dcache can be adapted to per-CPU • CFQ can be adapted to per-queue • The less global locks the better 9
    • Industry talk – Baidu (Cont.) • Service types a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data node, and process 8-9 K queries per second. b). Distributed system: large files, sequential read/write. c). cache/KV storage: between a and b d). Web Server: CPU bound. • For a), read() sucks. Use mmap() to read blocks adhead to void kernel/userspace memory copy. mmap() can not use page cache LRU. Call readahead() after each mmap() to mark pages as read. mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead() and sync_readaheadv() With above, memory is now the bottleneck. Doing 10G+ MB read. • Google patch for reducing mm->mmap_sem hold time In do_page_fault(), drop mem->sem if page not found, then read it and get the lock again. 10
    • Industry talk – Baidu (Cont.) • 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages each time Get better performance for sequential IO OCFS2 uses 1MB fs block size and 4K page size • PCIE compress card + ECC 11
    • Industry Talk - Taobao • TFS • Tair uses an update server to record updates; apply updates to production system during mid-night • A config server is used to minimize meta server workload A versioned bucket table is maintained by config server and stored in each data server. Client can manipulate data location with the bucket table returned by config server. • Both TFS and Tair are open source projects now 12
    • Industry Talk - EMC • Introduce the recovery methods and check methods use in the file systems from ext2 to btrFS • Emphasis the importance of FSCK; Introduce the issues within FSCK when checking a huge file system; Collect the proposals to solve this problem • pNFS learning notes 13
    • Q&A 14