• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

20101030 clsf2010

on

  • 964 views

 

Statistics

Views

Total Views
964
Views on SlideShare
964
Embed Views
0

Actions

Likes
1
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20101030 clsf2010 20101030 clsf2010 Presentation Transcript

    • CLSF 2010 1
    • Agenda • Photos • Attendances Overview • Key Topics • Industry Talks 2
    • LSF 2010, Shanghai 3
    • LSF 2010 participant/companies Company # of participant Key background Intel 5 Kernel performance, SSD, mem mgmt EMC 5 Storage, file system Fujitsu 4 IO Controller, btrfs Taobao 3 Distributed storage, taobao server Novell 2 Suse server, HA Oracle 2 OCFS2 dev/test Baidu 2 Baidu kernel optimization Canonical 2 Redhat 1 Network driver 4
    • Key topics Topic Slides? Description Page writeback Y Dirty page ratio limit Control process to write pages CFQ, IO controller Y CFQ introduction and further features BTRFS N Memory consuming, fsck speed SSD/Block layer Y Block layer issues with SSD Kernel Tracing N ftrace VFS scalability N Multi-core challenges Kernel testing Y Intel kernel auto test framework Industrial talk: Taobao Y TFS, Tair Industrial talk: Baidu N The architecture of Baidu search system Industrial talk: EMC N FSCK 5
    • Writeback - Wu Fengguang • vmscan is a bottleneck decrease dirty ratio under memory pressure, so vmscan can less possibly find dirty pages on page allocation. • pageout(page) calls wirtepage() to write to disk, which is a performance killer since it does random writes let flusher write. expand single 4K write to 4MB write. So more dirty pages are reclaimed and flushed. • balance_dirty_page() should not write: random write kills performance let flusher write and ask process to sleep. Three proposals: a). wait io completion: NFS bumpy completion, need smoother sleep method. b). sleep (dirtied *3/2 / write_bandwidth) c). sleep (dirtied / throttle_bandwidth) • flusher default write size (4MB -> 128MB), will be dynamic in the future. Baidu's practice: SSD random write is really bad. For sequential write, increase wb size (4MB -> 40MB) will get 120% SSD performance 6
    • Btrfs - Coly Li • Has too much love from linux community two years ago > now • Used in MeeGo Project • Taobao plan to push industrial deployment in 2-3 years 10T per data server in TFS cluster Use on SSD and SATA hybrid data server Metadata will be allocated on SSD and data on SATA. • Dynamic data relocation with hot data tracking patch. For generic fs usage, need to deal with device mapper to get device speed information. • FSCK A difficult must. Currently assigned to Fujitsu. 7
    • SSD challenges - Li Shaohua • Throughput: same issue as network • Disk controller gap and big locks (queue locks & scsi locks) • Interrupt related: a. smp affinity: single queue, one CPU to deal with irqs b. blk_iopoll: poll more req in one req • Need hardware multiqueue • CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue) • Queue lock contention vs. cache lock contention See Andi Kleen's talk in Tokyo Linux Conf • Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned 8
    • VFS scalability- Ma Tao • With multi-cores, all global locks suck • Globle icache/dcache can be adapted to per-CPU • CFQ can be adapted to per-queue • The less global locks the better 9
    • Industry talk – Baidu (Cont.) • Service types a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data node, and process 8-9 K queries per second. b). Distributed system: large files, sequential read/write. c). cache/KV storage: between a and b d). Web Server: CPU bound. • For a), read() sucks. Use mmap() to read blocks adhead to void kernel/userspace memory copy. mmap() can not use page cache LRU. Call readahead() after each mmap() to mark pages as read. mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead() and sync_readaheadv() With above, memory is now the bottleneck. Doing 10G+ MB read. • Google patch for reducing mm->mmap_sem hold time In do_page_fault(), drop mem->sem if page not found, then read it and get the lock again. 10
    • Industry talk – Baidu (Cont.) • 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages each time Get better performance for sequential IO OCFS2 uses 1MB fs block size and 4K page size • PCIE compress card + ECC 11
    • Industry Talk - Taobao • TFS • Tair uses an update server to record updates; apply updates to production system during mid-night • A config server is used to minimize meta server workload A versioned bucket table is maintained by config server and stored in each data server. Client can manipulate data location with the bucket table returned by config server. • Both TFS and Tair are open source projects now 12
    • Industry Talk - EMC • Introduce the recovery methods and check methods use in the file systems from ext2 to btrFS • Emphasis the importance of FSCK; Introduce the issues within FSCK when checking a huge file system; Collect the proposals to solve this problem • pNFS learning notes 13
    • Q&A 14