Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf Software
Upcoming SlideShare
Loading in...5
×
 

Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf Software

on

  • 6,204 views

Multiple core CPUs are here. Conventional wisdom holds that, to take best advantage of these processors, we now need to rewrite sequential applications to make them multithreaded. Because of the ...

Multiple core CPUs are here. Conventional wisdom holds that, to take best advantage of these processors, we now need to rewrite sequential applications to make them multithreaded. Because of the difficulty of programming correct and efficient multithreaded applications (e.g., race conditions, deadlocks, and scalability bottlenecks), this is a major challenge.

This talk presents two alternative approaches that bring the power of multiple cores to today's software. The first approach focuses on building highly-concurrent client-server applications from legacy code. I present a system called Flux that allows users to take unmodified off-the-shelf *sequential* C and C++ code and build concurrent applications. The Flux compiler combines the Flux program and the sequential code to generate a deadlock-free, high-concurrency server. Flux also generates discrete event simulators that accurately predict actual server performance under load. While the Flux language was initially targeted at servers, we have found it to be a useful abstraction for sensor networks, and I will briefly talk about our use of an energy-aware variant of Flux in a deployment on the backs of endangered turtles. The second approach uses the extra processing power of multicore CPUs to make legacy C/C++ applications more reliable. I present a system called DieHard that uses randomization and replication to transparently harden programs against a wide range of errors, including buffer overflows and dangling pointers. Instead of crashing or running amok, DieHard lets programs continue to run correctly in the face of memory errors with high probability. This is joint work with Brendan Burns, Kevin Grimaldi, Alex Kostadinov, Jacob Sorber, and Mark Corner (University of Massachusetts Amherst), and Ben Zorn (Microsoft Research).

Statistics

Views

Total Views
6,204
Views on SlideShare
6,194
Embed Views
10

Actions

Likes
3
Downloads
165
Comments
1

1 Embed 10

http://www.slideshare.net 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Another excellent slide show to understand multicore CPUs.

    It is easy to create C# applications that take advantage of multicore CPUs. You need to read a good book. That's all!!

    Read this article and the book 'C# 2008 and 2005 threaded programming'. http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp

    I had never used threads and after reading the first 7 chapters, I am exploiting my Core 2 Quad. Highly recommended to C# developers.

    http://www.packtpub.com/beginners-guide-for-C-sharp-2008-and-2005-threaded-programming/book

    Cheers!

    Henry Simlan
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf Software Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf Software Presentation Transcript

  • Exploiting Multicore CPUs Now: Scalability and Reliability for Off-the-shelf Software Emery Berger University of Massachusetts Amherst
  • Research Overview
    • High-performance memory managers
      • Hoard allocator for concurrent apps [ASPLOS-IX]
      • Heap Layers infrastructure [PLDI 01]
      • Reaps ( regions + heaps ) [OOPSLA 02]
    • Cooperative memory management (OS + GC)
      • Bookmarking: GC without paging [PLDI 04]
      • CRAMM VM + any GC, max thruput [ISMM 04, OSDI 06]
    • And:
      • Memory management studies
        • Custom allocation [OOPSLA 02] , GC vs. malloc [OOPSLA 05]
      • Support for contributory applications
        • Transparent contribution: memory, disk [USENIX 06, FAST 07]
      • Plus other compiler & runtime stuff
    Transparently improving performance , robustness & reliability (PL + OS)
  • Concurrent Memory Allocators
    • Previous allocators unsuitable for multithreaded apps
      • Serialized heap
        • Protected by lock
      • Allocator-induced false sharing
      • Poor space bounds: blowup
        • O(P), O(T), or unbounded increase in memory
    processor 0 processor 1 = in use , processor 0 = free , on heap 1 Key: free(x1) x2= malloc(1) free(x2) x1= malloc(1) x3= malloc(1) free(x3) “ pure private heaps” (STL, Cilk, others)
  • Hoard Memory Allocator
    • Hoard
      • Scalable heap
        • Provably low synch overhead
      • Optimal space consumption: blowup = O(1)
      • Avoids false sharing
    • www.hoard.org
      • 40,000+ downloads
      • AOL, BT, Philips, Credit Suisse, Novell, etc.
  • The Cores Have Arrived
    • Hurray! Now what?
    • Multithreading problems:
      • Data races
      • Deadlock & livelock
      • Scalability bottlenecks
    • Automatic Parallelization?
  • Exploit Multicores Now!
    • Taking advantage of multicores without rewriting a line of code :
      • Build scalable applications from parts
        • Flux: “glue” language for easily building highly-concurrent servers [USENIX 06]
      • Increase reliability
        • DieHard: lets C/C++ programs run correctly in face of memory errors with high probability [PLDI 06]
  • Flux A Language for Programming High-Performance Servers joint work with Brendan Burns, Kevin Grimaldi, Alex Kostadinov, Mark Corner University of Massachusetts Amherst
  • Motivating Example: Image Server
    • Client
      • Requests image @ desired quality, size
    • Server
      • Images: RAW
      • Compresses to JPG
      • Caches requests
      • Sends to client
    http://server/Easter-bunny/ 200x100/75 not found client image server
  • Problem: Concurrency
    • Could write sequential code but…
      • More clients (latency)
      • Bigger server
        • Multicores , multiprocessors
    • One approach: threads
      • Risk deadlock, etc.
      • Mixes program logic & concurrency control – ties to runtime (threads?!)
    clients image server
  • The Flux Programming Language
    • Unmodified C, C++ (or Java) – black boxes
    • Compose with Flux program
      • Assume #clients » #cores
    • High-quality server + performance tools:
      • Statically enforces atomicity w/o deadlock
      • Path profiling
      • Discrete event simulator
    High-performance & deadlock-free concurrent programming w/ sequential components
  • Flux Server “Main”
    • Source nodes originate flows
      • Conceptually in separate thread
      • Executes inside implicit infinite loop
        • Initiates flow (“thread”) for each image request
    Listen source Listen  Image; image server ReadRequest Write Compress Complete ReadRequest Write Compress Complete ReadRequest Write Compress Complete
  • Flux Image Server
    • Basic image server requires:
      • HTTP parsing ( http )
      • Socket handling ( socket )
      • JPEG compression ( libjpeg )
      • All UNIX-style C libraries
    • Abstract node = flow across nodes
      • Concrete or abstract
    ReadRequest Write Compress Complete libjpeg socket http http Image = ReadRequest  Compress  Write  Complete; image server
  • Control Flow
    • Direct flow via user-supplied predicate types
      • Type test applied to output
        • Note: no variables – dispatch on output “type”
      • Here: cache frequently requested images
    Listen ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit handler handler Image = ReadRequest  Handler  Write  Complete; typedef hit TestInCache; Handler:[_,_,hit] = ; Handler:[_,_,_] = ReadFromDisk  Compress  StoreInCache;
  • Supporting Concurrency
    • Many clients = concurrent flows
      • Must keep cache consistent
    • Atomicity constraints
      • Same name = mutual exclusion (2PL)
      • Apply to nodes or whole flow (abstract node)
    atomic CheckCache {  }; atomic Complete {  ,  }; atomic StoreInCache {  }; Listen ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit handler handler
  • More Atomicity
    • Reader / writer constraints
      • Multiple readers or single writer (default)
        • atomic ReadList: {listAccess ? };
        • atomic AddToList: {listAccess ! };
    • Per-session constraints
      • User-supplied function ≈ hash on source
        • Added to flow ≈ chooses from array of locks
    atomic AddHasChunk: {chunks (session) };
  • Preventing Deadlock
    • Naïve execution can deadlock
    • Establish canonical lock order
      • Partial order
      • Alphabetic by name
    atomic A: {z,y}; atomic B: {y,z}; atomic A: {y,z}; atomic B: {y,z};
  • Preventing Deadlock, II
    • Harder with abstract nodes
    A = B; C = D; atomic A:{z}; atomic B:{y}; atomic C:{y,z}; A = B; C = D; atomic A:{ y ,z}; atomic B:{y}; atomic C:{y,z};
    • Solution: Elevate constraints; fixed point
    B A C B A:{z} C B A:{z} C:{y} B A:{y,z} C
  • Almost Complete Flux Image Server
    • Concise, readable expression of server logic
      • No threads, etc.: simplifies programming, debugging
    Listen image server source Listen  Image; Image = ReadRequest  CheckCache  Handler  Write  Complete; Handler[_,_,hit] = ; Handler[_,_,_] = ReadFromDisk  Compress  StoreInCache; atomic CheckCache: {cacheLock}; atomic StoreInCache: {cacheLock}; atomic Complete: {cacheLock}; handle error ReadInFromDisk  FourOhFour; ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit handler handler
  • Flux Outline
    • Intro to Flux: building a server
      • Components, flow
      • Atomicity, deadlock avoidance
    • Performance results
      • Server performance
      • Performance prediction
    • Future work
  • Flux Results
    • Four servers:
      • Image server [23]
        • + libjpeg
      • Multi-player game [54]
      • BitTorrent [84]
        • 2 undergrads: 1 week!
      • Web server [36]
        • + PHP
    • Evaluation
      • Benchmark: variant of SPECweb99
      • Compared to Capriccio [SOSP03] , SEDA [SOSP01]
    ReadRequest ReadRequest ReadRequest CheckCache Compress CheckCache CheckCache CheckCache Write StoreInCache thread-per-connection event-driven thread pool ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit handler handler ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit handler handler
  • Web Server
  • Performance Prediction observed parameters
  • Performance Prediction observed parameters
  • Flux Conclusion
    • Flux language & system
      • Concurrency made easier
      • Build high-performance servers from sequential parts
        • Deadlock-free
      • Predict & debug performance before deployment
  • Future Work: eFlux
    • Wood turtle ( Clemmys insculpta )
    • eFlux : language for perpetual computing
      • Sensors ≈ client-server!
      • Energy-aware language
        • Flows decorated with power states (e.g., “high”, “low”)
        • Provide different levels of service depending on available & predicted energy
  • DieHard: Probabilistic Memory Safety for Unsafe Programming Languages Joint work with Ben Zorn (Microsoft Research)
  • Problems with Unsafe Languages
    • C, C++: pervasive apps, but langs. memory unsafe
    • Numerous opportunities for security vulnerabilities, errors
      • Double free
      • Invalid free
      • Uninitialized reads
      • Dangling pointers
      • Buffer overflows (stack & heap )
  • Current Approaches
    • Unsound, may work or abort
      • Windows, GNU libc, etc., Rx [Zhou]
    • Unsound, will definitely continue
      • Failure oblivious [Rinard]
    • Sound, definitely aborts (fail-safe)
      • CCured [Necula] , CRED [Ruwase & Lam], SAFECode [Dhurjati, Kowshik & Adve], &c.
        • Slowdowns: 30% - 20X
        • Requires C source, programmer intervention
        • Garbage collection or partially sound (pools)
      • Good for debugging , less for deployment
  • Soundness for “Erroneous” Programs
    • Normally: memory errors ) ? …
    • Consider infinite-heap allocator:
      • All new s fresh ; ignore delete
        • No dangling pointers, invalid frees, double frees
      • Every object infinitely large
        • No buffer overflows, data overwrites
    • Transparent to correct program
    • “ Erroneous” programs sound
  • Probabilistic Memory Safety
    • Approximate  with M -heaps (e.g., M=2)
    • Naïve: pad allocations, defer deallocations
      • No protection from larger overflows
        • pad = 8 bytes, overflow = 9 bytes…
      • Deterministic : overflow crashes everyone
    • DieHard : randomize M-heap
      • Probabilistic memory safety
        • Independent across heaps
      • Efficient implementation…
  • Implementation Choices
    • Conventional, freelist-based heaps
      • Hard to randomize, protect from errors
        • Double frees, heap corruption
    • What about bitmaps? [Wilson90]
      • Catastrophic fragmentation
        • Each small object likely to occupy one page
    obj obj obj obj pages
  • Randomized Heap Layout
    • Bitmap-based, segregated size classes
      • Bit represents one object of given size
        • i.e., one bit = 2 i+3 bytes, etc.
      • Prevents fragmentation
    00000001 1010 10 size = 2 i+3 2 i+4 2 i+5 metadata heap
  • Randomized Allocation
    • malloc(8) :
      • compute size class = ceil(log 2 sz) – 3
      • randomly probe bitmap for zero-bit (free)
    • Fast: runtime O(1)
      • M=2 ) E[# of probes] · 2
    00000001 1010 10 size = 2 i+3 2 i+4 2 i+5 metadata heap
    • malloc(8) :
      • compute size class = ceil(log 2 sz) – 3
      • randomly probe bitmap for zero-bit (free)
    • Fast: runtime O(1)
      • M=2 ) E[# of probes] · 2
    Randomized Allocation 00010001 1010 10 size = 2 i+3 2 i+4 2 i+5 metadata heap
    • free(ptr) :
      • Ensure object valid – aligned to right address
      • Ensure allocated – bit set
      • Resets bit
    • Prevents invalid frees, double frees
    Randomized Deallocation 00010001 1010 10 size = 2 i+3 2 i+4 2 i+5 metadata heap
  • Randomized Deallocation
    • free(ptr) :
      • Ensure object valid – aligned to right address
      • Ensure allocated – bit set
      • Resets bit
    • Prevents invalid frees, double frees
    00010001 1010 10 size = 2 i+3 2 i+4 2 i+5 metadata heap
    • free(ptr) :
      • Ensure object valid – aligned to right address
      • Ensure allocated – bit set
      • Resets bit
    • Prevents invalid frees, double frees
    Randomized Deallocation 000 0 0001 1010 10 size = 2 i+3 2 i+4 2 i+5 metadata heap
  • Randomized Heaps & Reliability
    • Objects randomly spread across heap
    • Different run = different heap
      • Errors across heaps independent
    2 3 4 5 3 1 6 object size = 2 i+4 object size = 2 i+3 … My Mozilla: “malignant” overflow Your Mozilla: “benign” overflow 1 1 6 3 2 5 4 …
  • DieHard software architecture
    • “ Output equivalent” – kill failed replicas
    broadcast vote input output execute replicas (separate processes)
    • Each replica has different allocator
    replica 3 seed 3 replica 1 seed 1 replica 2 seed 2
  • DieHard Results
    • Analytical results (pictures!)
      • Buffer overflows
      • Uninitialized reads
      • Dangling pointer errors (the best)
    • Empirical results
      • Runtime overhead
      • Error avoidance
        • Injected faults & actual applications
  • Analytical Results: Buffer Overflows
    • Model overflow as write of live data
      • Heap half full (max occupancy)
  • Analytical Results: Buffer Overflows
    • Model overflow as write of live data
      • Heap half full (max occupancy)
  • Analytical Results: Buffer Overflows
    • Model overflow: random write of live data
      • Heap half full (max occupancy)
  • Analytical Results: Buffer Overflows
    • Replicas: Increase odds of avoiding overflow in at least one replica
    replicas
  • Analytical Results: Buffer Overflows
    • Replicas: Increase odds of avoiding overflow in at least one replica
    replicas
  • Analytical Results: Buffer Overflows
    • Replicas: Increase odds of avoiding overflow in at least one replica
    replicas
    • P(Overflow in all replicas) = (1/2) 3 = 1/8
    • P(No overflow in ¸ 1 replica) = 1-(1/2) 3 = 7/8
  • Analytical Results: Buffer Overflows
    • F = free space
    • H = heap size
    • N = # objects worth of overflow
    • k = replicas
    • Overflow one object
  • Empirical Results: Runtime
  • Empirical Results: Runtime
  • Empirical Results: Error Avoidance
    • Injected faults:
      • Dangling pointers ( @ 50%, 10 allocations)
        • glibc : crashes ; DieHard : 9/10 correct
      • Overflows (@1%, 4 bytes over) –
        • glibc : crashes 9/10, inf loop ; DieHard : 10/10 correct
    • Real faults:
      • Avoids Squid web cache overflow
        • Crashes BDW & glibc
      • Avoids dangling pointer error in Mozilla
        • DoS in glibc & Windows
  • DieHard Conclusion
    • Randomization + replicas = probabilistic memory safety
      • Improves over today (0%)
      • Useful point between absolute soundness (fail-stop) and unsound
      • Future work – locate & fix errors automatically
    • Trades hardware resources (RAM,CPU) for reliability
      • Hardware trends
        • Larger memories, multi-core CPUs
      • Follows in footsteps of ECC memory, RAID
  • The End
    • http://www.cs.umass.edu/~emery/diehard
      • Linux, Solaris (stand-alone & replicated)
      • Windows (stand-alone only)
    flux : from Latin fluxus, p.p. of fluere = “to flow” http://flux.cs.umass.edu
      • Hosted by Flux web server
      • Download via Flux BitTorrent
  • Backup
  • Handling Errors
    • What if image requested doesn’t exist?
      • Error = negative return value from component
      • Remember – nodes oblivious to Flux
    • Solution: error handlers
      • Go to alternate paths on error
      • Possible extension – can match on error paths
    Listen FourOhFour handle error ReadInFromDisk  FourOhFour; ReadRequest ReadInFromDisk Write CheckCache Compress StoreInCache Complete hit handler handler
  • Flux Outline
    • Intro to Flux: building a server
      • Components
      • Flows
      • Atomicity
    • Performance results
      • Server performance
      • Performance prediction
    • Future work
  • Probabilistic Memory Safety
    • Fully-randomized memory manager
      • Increases odds of benign memory errors
      • Ensures independent heaps across users
    • Replication
      • Run multiple replicas simultaneously, vote on results
        • Detects crashing & non-crashing errors
    DieHard: correct execution in face of errors with high probability