RailswayCon 2010 - Dynamic Language VMs
Upcoming SlideShare
Loading in...5
×
 

RailswayCon 2010 - Dynamic Language VMs

on

  • 2,794 views

 

Statistics

Views

Total Views
2,794
Views on SlideShare
2,751
Embed Views
43

Actions

Likes
3
Downloads
37
Comments
0

1 Embed 43

http://www.slideshare.net 43

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

RailswayCon 2010 - Dynamic Language VMs RailswayCon 2010 - Dynamic Language VMs Presentation Transcript

    • Dynamic Language VMs
      Ruby 1.9
      Lourens Naude, WildfireApp.com
    • Background
    • Independent Contractor
      • Ruby / C / integrations
      • Well versed full stack
      • Architecture
    • WildfireApp.com
      • Social Marketing platform
      • Large whitelabel clients
      • Bursty traffic – Lady Gaga, EA, Gatorade etc.
  •  
    • RUBY VM INTERNALS ?
    • A GOOD CRAFTSMEN KNOWS HIS TOOLS
    • A BAD CRAFTSMEN BLAMES HIS TOOLS
    • Typical public facing apps
    • Interaction patterns
      • Request / response
      • Time
      • Event driven
    • Overheads
      • Data transfer (I/0)
      • Serialization / coercion (CPU)
      • VM – allocation, symbol tables etc. (CPU + mem)
      • Business requirements (CPU)
    • Ruby daemon - strace
    Process 5856 detached % time calls syscall ------ ------- ------------- 89.69 5092 recvfrom 5.35 5093 sendto 2.49 26300 stat 2.05 11004 clock_gettime
    • Ruby daemon - ltrace
    % time calls function ------ -------- -------- 95.78 635173 memcpy 1.38 25862 malloc 0.79 14984 free 0.60 11403 strcmp
    • System Resources
    • Data latency
      • CPU cache
      • Memory – local
      • Disk - local
      • Memory + disk - remote
    • Record retrieval with ORM
      • Fetch results (local/remote memory + disk)
      • Serialization + conversion (CPU)
      • Object instantiation (CPU + memory)
      • Optional memcached (local or remote memory)
    • RUBY ?
    • Conversion – rows to hash
    Benchmark.bm do |b| b.report do 1000.times{ ActiveRecord::Base.connection.select_rows "SELECT * FROM users" } end end user system total real 0.300000 0.040000 0.340000 ( 0.505095)
    • Conversion – rows to objects
    Benchmark.bm do |b| b.report do 1000.times{ ActiveRecord::Base.connection.select_all "SELECT * FROM users" } end end user system total real 0.510000 0.050000 0.560000 ( 0.719201)
    • Instantiation
    Benchmark.bm do |b| b.report do 100_000.times{ 'string'.dup } end end user system total real 0.040000 0.000000 0.040000 ( 0.043791)
    • Serialization – load + dump
    Benchmark.bm do |b| b.report do 100_000.times{ Marshal.load(Marshal.dump('ruby string')) } end end user system total real 1.660000 0.010000 1.670000 ( 1.699882)
    • Roadmap
    • VM Architecture
      • Symbol table
      • Opcodes / instructions
      • Dispatch
      • Optimizations
    • Ruby language
      • Object model
      • Garbage Collection
      • Contexts and control flow
      • Concurrency
    • VM ARCHITECTURE
  •  
    • Changes
    • Ruby 1.8 artifacts
      • Parser && AST nodes
      • Object model
      • Garbage Collection
      • No immediate performance gains for String manipulation etc.
    • Codegen phase
      • Better optimization hooks
      • Faster runtime
    • AST AND CODEGEN
  •  
    • Abstract Syntax Tree (AST)
    • Structure
      • Grammar representation
      • Annotations attach semantics to nodes
      • Possible to refactor the tree – more nodes, less complexity
    • Example nodes
      • Literals, values and assignments
      • Method calls, arguments and return values
      • Jumps – if, else, iterators
      • Unconditional jumps – exceptions, retry etc.
    • Code generation
    • How it works
      • Converts the AST to compiled code segments
      • Reduces a tree to a linear and ordered instruction set
      • Fast execution – no tree walking + native code
    • Workflow
      • Preprocessing – AST refactoring (!YARV)
      • Codegen, nodes -> instruction sequences
      • Postprocessing – replace with optimal instruction sequences (peephole optimization)
      • Pre and postprocessing phases may be multiple passes
    • LOOKUPS
  •  
    • Symbol / Hash tables
    • How it works
      • Constant time access to int/char indexed values
      • Table defaults: 11 bins, 5 entries per bin
      • Bins++, sequential lookup inside bins
      • Lookup of methods, variables, encodings etc.
    • Symbol
      • Entity with both a String and Number representation
      • !(String || Symbol), points to a table entry
      • Developer identifies by name, VM by int
      • Immutable for performance – watch out for memory
    • VM INSTRUCTIONS
    • VM instructions / opcodes
    • Stateless functions
      • 80+ currently
      • Generated from definitions at interpreter compile time (existing ruby requirement for 1.9)
      • Instruction / opcode / operands notation
    • Categories and examples
      • variable: get or set local variable
      • class / module: definition
      • method / iterator: invoke method, call block
      • Optimization: redefines common +, <<, * contracts
    • Managing opcode sequences
    • Stack Machine
      • 2 instruction types: push && pop
      • Move / copy values, top of stack -> elsewhere
      • SP: top of stack pointer, BP: bottom of stack pointer
    • Example
      • %w(a b c)
      • Put strings “a”, “b” and “c” on the stack
      • Fetch top 3 stack elements
      • Create an array from them
    • Instruction sequence
    • Opcode collection
      • Instruction dispatch can be a bottleneck
      • Optimizing simple instructions is very important
      • Likely a small subset of the typical web app's hot path
    • Dispatch techniques
      • Direct Threaded Dispatch : fastest jump to next opcode / instruction
      • Switch Dispatch : slower, but portable
    • DISPATCH AND CACHE
    • Dispatch techniques
    • Direct Threaded Dispatch
      • Represents an instruction by the address of the routine that implements it
      • Forth, Python 3
      • Not portable: GCC first class labels
    • Switch Dispatch
      • CPU branch mispredictions, depending on pipeline length
      • Up to 50% slower than Threaded dispatch
      • Portable
    • VM Caches
    • Versioning
      • State counter scopes caches to the current VM state
      • Lazy invalidation – just bump the version
    • Expires on
      • constant definition
      • constant removal
      • method definition
      • method removal
      • method cache changes (covered later)
    • OPTIMIZATIONS
    • Optimization Limitations
    • Static Analysis
      • Examine source code without execution
      • Dynamic analysis – runtime introspection
    Dynamic nature of Ruby
      • Literals are generally safe to consider for optimizations
      • Constants can be redefined
      • Open classes – variable method table
      • Object#method_missing
      • No explicit return types
    • Common optimizations
    • Constant folding
    • Constant propagation
    • Dead code elimination
    • Subexpression elimination
    • Method in-lining
    • Cloning
    • Peephole Optimization
    • * not all implemented in YARV
    • Constant folding
      1 + 2 # 3
    • 2 * 3 # 3 + 3
    • 2 * 1 # 2
    • 2 ** 2 # 2 *2
    • class Fixnum
    • def +(*args) # dynamic Ruby spec
    • end
    • end
    • Code elimination
    loop { # loop { begin # begin # eval'ed code # eval'ed code break # break break # ensure ensure # end end # } }
    • Subexpression elimination
    x = x – (y * 2) z = z – (y * 2) t = y * 2 x = x – t z = z - t
    • Constant propagation
    def a b = 20 c(3 * b) end def a # def a b = 20 # c(60) c(3 * 20) # end end
    • In-lining
    def b 2 * 3 end def a # def a def a 2 + b # 2 + 2 * 3 2 + (2 * 3) end # end end
    • Cloning
    def a(b, c) b << c expire_cache end a('railsway', 'con') def a_railsway_con 'railsway' << 'con' expire_cache end
    • Peephole Optimization (before)
    x = true # 0008 getlocal x if x # 0010 branchunless 17 else # 0012 jump 14 end # 0014 putnil 0015 jump 18 0017 putnil 0018 leave
    • Peephole Optimization (after)
    x = true # 0008 getlocal x if x # 0010 branchunless 15 else # 0012 putnil end # 0013 leave 0014 pop 0015 putnil 0016 leave
    • OBJECTS
    • Object Requirements
    • Stateful
    • Identity
      • Unique identifier to represent the object at runtime
    • Methods
      • Change or query object state
      • Command and Query pattern
    • Object structure
      typedef unsigned long VALUE;
    struct RBasic { VALUE flags; # object flags VALUE klass; # instance of ... }
    • Object structure (cont)
    • Casting
      • Pointer type that represent addresses to language structures
      • RBASIC(obj)->flags
      • ((struct RBasic *)obj)->flags
    Flags
      • frozen
      • marked
      • tainted
      • embedded status
    • Classes / modules structure
      struct RClass {
    struct RBasic basic; # object structure rb_classext_t *ptr; # external class struct st_table *m_tbl; # method table struct st_table *iv_index_tbl; # ivars }
    • Class / module structure (cont)
    • Casting
      • RCLASS(a_str)->ptr.super #=> Object
      • RCLASS(a_fixnum)->ptr.super #=> Integer
    Attributes
      • Symbol tables for methods and ivars
      • Class / module distinction through flags
    • Special objects
    • Immediates
      • No runtime casting overheads – fits in VALUE
      • nil #=> 4
      • true #=> 2
      • false #=> 0
      • Symbols
      • Fixnums <= 30 bits
      • Floats and Bignum are complex objects – hence poor Floating Point benchmarks
      • RFLOAT(float_obj)->float_value #=> a double
    • Object memory layout
    • Object#object_id (32 bit architecture)
      • sizeof(VALUE) is 4 bytes
      • Objects, even, multiples of 4
      • Symbols, even, multiples of 8
      • Integers, odd
      • Immediates <= 4
    • Mutable Objects
      struct RString {
    struct RBasic basic; union {struct {long len; char *ptr union { long capa; VALUE shared; }aux; }heap;
    • Mutable Objects (cont)
    • String and Array
      • require the ability to shrink / grow capacity
      • allocates slightly more data than required
      • Avoids malloc, realloc and memmove overhead
      • Short strings “str”
      • Short arrays %w(a r y)
    • Shared Objects
      str = 'railsway';
    str2 = “#{str}con” # shared ref str3 = str << 'con' # copy + mod ary = %w(railsway con) ary2 = ary.dup # shared ref ary3 = ary2.delete_at(1) # copy + mod
    • Method Dispatch
    • Language constraints
      • Loose typing
      • Open classes
      • Method calls can never be reduced to CALL(a_method)
      • Search overhead
    • Language constraints
    • Dispatch sequence
    • Deref class pointer
    • Check methods table
    • Call method or delegate to superclass
  •  
    • call VS send
    • obj.__send__ :method
      • We never call methods
      • Send query / command messages to objects
      • Methods return values – RPC style messaging
    • Method cache
      • Method cache == router
      • 95% hit rate when warm
      • Method redefinition, module inclusion etc. clears the method cache / “routing table”
      • Introduces significant overhead for subsequent method calls
    • Method cache don'ts
      class SomeController < AC::Base
    • def show
    • # busts method cache for the whole VM
    • @user.extend SomeBehavior
    • end
    • end
  •  
  •  
  •  
    • Instance var changes
    • Optimizations
      • First 3 ivars is embedded on the object
      • Avoids symbol table lookups
    • ivar table
      • Table is per class, not per object
      • Ivar table is shared by all instances of the same class
      • Saves on memory footprint of a table per instance
    • GARBAGE COLLECTION
    • Process memory layout
    • Code segment
      • Executable code
      • Read only
    • Stack segment
      • Stack storage
      • Addressed with stack pointers
    • Heap Memory available for program / developer use
    • Malloc
    • Usable / free space
      • Managed by a free list
      • Linear search overhead to find free chunks
    • Better layout
      • Index free chunks by size intervals
  •  
  •  
    • GC terminology
    • Root set
      • Directly accessible without pointer scanning
      • C stack, global vars, global constants etc.
    • Unreachable hooks
      • Variable assignment to nil
      • method return etc.
    • Conservative VM hands out raw pointers to objects
    • GC strategies
    • Stop the World
      • Minimal allocation overhead
      • Hands out objects while heap space is available
      • Halts execution to reclaim memory
      • Very disruptive in the hot path
    • Incremental
      • Collection activity during allocation
      • Smoother, but with some minor overhead
      • Suitable for hard realtime environments
    • Scripting GC
    • Mark and Sweep
      • Identifies live objects
      • Assumes remainder is for collection
      • Concerned with unreachable objects
    • Stop and Copy
      • 2 heap spaces (double memory overhead)
      • 1 active, 1 inactive
      • Copies reachable chunks to the new active area
      • Concerned with live objects
    • Common GC Issues
    • Conservative GC
      • Memory fragmentation
      • Dangling pointers
      • Memory leaks from circular garbage
    • Allocation
      • Bursty allocation
      • Knowledge of pointer layout and chunks required
    • Ruby heap layout
    • Multiple heaps
      • Referenced through heap list
      • Composed of multiple slots
      • Freed when empty ...
      • IF all slots is tagged as being free
      • A Rails app allocates 4 to 6 heaps on startup
  •  
  •  
  •  
    • Slot layouts
    • Per heap
      • Each slot references a single object
      • Defaults to 10 000 slots for the first heap
      • Threshold of 4096 free slots per heap
      • Free list points to the next free slot
    • Heap growth
      • Next allocated heap has 1.8 capacity of the last one
      • That's why memory consumption's so high ...
    • Heap growth – small app
      >> 8 * 1.8
    • => 14.4
    • >> 8 * 1.8 * 1.8
    • => 25.92
    • >> 8 * 1.8 * 1.8 * 1.8
    • => 46.656
    • >> 8 * 1.8 * 1.8 * 1.8 * 1.8
    • => 83.9808
    • Heap growth – mid to large app
      => 83.9808
    • >> 8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8
    • => 151.16544
    • >> 8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8
    • => 272.097792
    • >> 8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8
    • => 489.7760256
    • Slot structure
      typedef struct RVALUE {
    • union {
    • struct {
    • VALUE flags; /* 0 when free */
    • struct RVALUE *next;
    • }free;
    • struct RObject object;
    • struct RFloat float;
    • ...
    • Pointer layout
    • Self describing
      • Program data area and heap
      • RVALUE union can accommodate any ruby object
      • Frames, variable structures etc. well defined also
      • 40 bytes (64 bit arch) represents a slot
      • Free list points to the next free slot
    • Ruby heap VS OS heap
    • Ruby heap
      • 20 bytes represents a slot
      • slot points to OS data, on the OS / system heap
    • OS heap
      • Thus a 20 byte slot can reference a 2MB chunk on the system heap
  •  
    • CRuby: Mark and Sweep
    • Conservative
      • Cannot determine with certainty if a value references an object – assume it's in use
    • Two phase implementation
      • Mark phase: identifies and flags reachable objects from the current program context
      • Sweep phase: iterates through the object space and …
      • free all objects not marked
      • unmark marked objects
    • Concerns
    • Performance
      • Runtime pauses
      • Work proportional to heap size
      • Prone to memory fragmentation (no compaction)
      • Recursive
    • Triggers
      • 8m malloc calls triggers GC
      • Every 8MB allocated triggers GC
      • Not enough heap reserve
    • GC in action
      # 4 objs, 1 Array, 3 Strings
    • ary1 = %w(a b c)
    • ary2 = %w(d e f)
    • # both ary1 and ary2 is reachable
    • ary1 = nil
    • # ary1 and it's contents is unreachable
  •  
  •  
  •  
    • Generational GC
    • Observations
      • Vast majority of objects are short lived – 80%+
      • Expensive to account for long lived objects
      • Parition by age and frequently collect short lived ones
    • How it works
      • Restrict GC to the most recently modified slots
      • These “sub heaps” are referred to as generations
      • Perform a full GC only when the youngest generation
      • fails to meet memory requirements
    • CONCURRENCY
    • Threading
    • Changes
      • Native OS Threads
      • Ruby Thread == pthread
      • Multiple cores ftw!
    • … but
      • Syscalls schedule, synchronize and create
      • Much more expensive to spawn and switch than green threads
      • Global VM Lock (GVL)
    • Global VM Lock (GVL)
    • How it works
      • Thread that owns the GVL is allowed to execute
      • Blocking operations should release the GVL
      • Automatically released when scheduled
      • C extensions : author does not concern with syncronization
    • Blocking VM operations
    • I/O
      • blocking reads and writes
      • DNS resolution or connects
      • Often has huge handshake overheads
    • Computations, processes and locks
      • Expensive Bignum ops blocked 1.8 interpreters
      • Process.waitpid
      • File locks
    • Releasing the GVL
    • Stable API
      • Blocking function: slow system call / computation
      • Unblock function: called on Thread interrupt
    • Pitfalls
    • Cannot access VALUEs (objects) in blocking functions
    • No integration with Ruby's exception / error handler
    • Lightweight Concurrency
    • Fibers
      • Coroutines – 4k stack size
      • Very fast user space context switches
      • Cooperative scheduling required
      • Fiber.yield pauses the activation record, which keeps context across multiple calls
    • Use cases
      • Generators
      • Blocking I/0 - Neverblock
    • In the pipeline
    • MVM: Multiple Virtual Machines
      • Shared process state
      • Sandboxed per VM application state
      • Distribute VMs across available cores
      • Message passing for inter VM communication
      • Most Ruby deployments aren't thread safe
      • MVM is well suited for this
    • QUESTIONS ?