• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
RailswayCon 2010 - Dynamic Language VMs
 

RailswayCon 2010 - Dynamic Language VMs

on

  • 2,728 views

 

Statistics

Views

Total Views
2,728
Views on SlideShare
2,685
Embed Views
43

Actions

Likes
3
Downloads
37
Comments
0

1 Embed 43

http://www.slideshare.net 43

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    RailswayCon 2010 - Dynamic Language VMs RailswayCon 2010 - Dynamic Language VMs Presentation Transcript

      • Dynamic Language VMs
        Ruby 1.9
        Lourens Naude, WildfireApp.com
      • Background
      • Independent Contractor
        • Ruby / C / integrations
        • Well versed full stack
        • Architecture
      • WildfireApp.com
        • Social Marketing platform
        • Large whitelabel clients
        • Bursty traffic – Lady Gaga, EA, Gatorade etc.
    •  
      • RUBY VM INTERNALS ?
      • A GOOD CRAFTSMEN KNOWS HIS TOOLS
      • A BAD CRAFTSMEN BLAMES HIS TOOLS
      • Typical public facing apps
      • Interaction patterns
        • Request / response
        • Time
        • Event driven
      • Overheads
        • Data transfer (I/0)
        • Serialization / coercion (CPU)
        • VM – allocation, symbol tables etc. (CPU + mem)
        • Business requirements (CPU)
      • Ruby daemon - strace
      Process 5856 detached % time calls syscall ------ ------- ------------- 89.69 5092 recvfrom 5.35 5093 sendto 2.49 26300 stat 2.05 11004 clock_gettime
      • Ruby daemon - ltrace
      % time calls function ------ -------- -------- 95.78 635173 memcpy 1.38 25862 malloc 0.79 14984 free 0.60 11403 strcmp
      • System Resources
      • Data latency
        • CPU cache
        • Memory – local
        • Disk - local
        • Memory + disk - remote
      • Record retrieval with ORM
        • Fetch results (local/remote memory + disk)
        • Serialization + conversion (CPU)
        • Object instantiation (CPU + memory)
        • Optional memcached (local or remote memory)
      • RUBY ?
      • Conversion – rows to hash
      Benchmark.bm do |b| b.report do 1000.times{ ActiveRecord::Base.connection.select_rows "SELECT * FROM users" } end end user system total real 0.300000 0.040000 0.340000 ( 0.505095)
      • Conversion – rows to objects
      Benchmark.bm do |b| b.report do 1000.times{ ActiveRecord::Base.connection.select_all "SELECT * FROM users" } end end user system total real 0.510000 0.050000 0.560000 ( 0.719201)
      • Instantiation
      Benchmark.bm do |b| b.report do 100_000.times{ 'string'.dup } end end user system total real 0.040000 0.000000 0.040000 ( 0.043791)
      • Serialization – load + dump
      Benchmark.bm do |b| b.report do 100_000.times{ Marshal.load(Marshal.dump('ruby string')) } end end user system total real 1.660000 0.010000 1.670000 ( 1.699882)
      • Roadmap
      • VM Architecture
        • Symbol table
        • Opcodes / instructions
        • Dispatch
        • Optimizations
      • Ruby language
        • Object model
        • Garbage Collection
        • Contexts and control flow
        • Concurrency
      • VM ARCHITECTURE
    •  
      • Changes
      • Ruby 1.8 artifacts
        • Parser && AST nodes
        • Object model
        • Garbage Collection
        • No immediate performance gains for String manipulation etc.
      • Codegen phase
        • Better optimization hooks
        • Faster runtime
      • AST AND CODEGEN
    •  
      • Abstract Syntax Tree (AST)
      • Structure
        • Grammar representation
        • Annotations attach semantics to nodes
        • Possible to refactor the tree – more nodes, less complexity
      • Example nodes
        • Literals, values and assignments
        • Method calls, arguments and return values
        • Jumps – if, else, iterators
        • Unconditional jumps – exceptions, retry etc.
      • Code generation
      • How it works
        • Converts the AST to compiled code segments
        • Reduces a tree to a linear and ordered instruction set
        • Fast execution – no tree walking + native code
      • Workflow
        • Preprocessing – AST refactoring (!YARV)
        • Codegen, nodes -> instruction sequences
        • Postprocessing – replace with optimal instruction sequences (peephole optimization)
        • Pre and postprocessing phases may be multiple passes
      • LOOKUPS
    •  
      • Symbol / Hash tables
      • How it works
        • Constant time access to int/char indexed values
        • Table defaults: 11 bins, 5 entries per bin
        • Bins++, sequential lookup inside bins
        • Lookup of methods, variables, encodings etc.
      • Symbol
        • Entity with both a String and Number representation
        • !(String || Symbol), points to a table entry
        • Developer identifies by name, VM by int
        • Immutable for performance – watch out for memory
      • VM INSTRUCTIONS
      • VM instructions / opcodes
      • Stateless functions
        • 80+ currently
        • Generated from definitions at interpreter compile time (existing ruby requirement for 1.9)
        • Instruction / opcode / operands notation
      • Categories and examples
        • variable: get or set local variable
        • class / module: definition
        • method / iterator: invoke method, call block
        • Optimization: redefines common +, <<, * contracts
      • Managing opcode sequences
      • Stack Machine
        • 2 instruction types: push && pop
        • Move / copy values, top of stack -> elsewhere
        • SP: top of stack pointer, BP: bottom of stack pointer
      • Example
        • %w(a b c)
        • Put strings “a”, “b” and “c” on the stack
        • Fetch top 3 stack elements
        • Create an array from them
      • Instruction sequence
      • Opcode collection
        • Instruction dispatch can be a bottleneck
        • Optimizing simple instructions is very important
        • Likely a small subset of the typical web app's hot path
      • Dispatch techniques
        • Direct Threaded Dispatch : fastest jump to next opcode / instruction
        • Switch Dispatch : slower, but portable
      • DISPATCH AND CACHE
      • Dispatch techniques
      • Direct Threaded Dispatch
        • Represents an instruction by the address of the routine that implements it
        • Forth, Python 3
        • Not portable: GCC first class labels
      • Switch Dispatch
        • CPU branch mispredictions, depending on pipeline length
        • Up to 50% slower than Threaded dispatch
        • Portable
      • VM Caches
      • Versioning
        • State counter scopes caches to the current VM state
        • Lazy invalidation – just bump the version
      • Expires on
        • constant definition
        • constant removal
        • method definition
        • method removal
        • method cache changes (covered later)
      • OPTIMIZATIONS
      • Optimization Limitations
      • Static Analysis
        • Examine source code without execution
        • Dynamic analysis – runtime introspection
      Dynamic nature of Ruby
        • Literals are generally safe to consider for optimizations
        • Constants can be redefined
        • Open classes – variable method table
        • Object#method_missing
        • No explicit return types
      • Common optimizations
      • Constant folding
      • Constant propagation
      • Dead code elimination
      • Subexpression elimination
      • Method in-lining
      • Cloning
      • Peephole Optimization
      • * not all implemented in YARV
      • Constant folding
        1 + 2 # 3
      • 2 * 3 # 3 + 3
      • 2 * 1 # 2
      • 2 ** 2 # 2 *2
      • class Fixnum
      • def +(*args) # dynamic Ruby spec
      • end
      • end
      • Code elimination
      loop { # loop { begin # begin # eval'ed code # eval'ed code break # break break # ensure ensure # end end # } }
      • Subexpression elimination
      x = x – (y * 2) z = z – (y * 2) t = y * 2 x = x – t z = z - t
      • Constant propagation
      def a b = 20 c(3 * b) end def a # def a b = 20 # c(60) c(3 * 20) # end end
      • In-lining
      def b 2 * 3 end def a # def a def a 2 + b # 2 + 2 * 3 2 + (2 * 3) end # end end
      • Cloning
      def a(b, c) b << c expire_cache end a('railsway', 'con') def a_railsway_con 'railsway' << 'con' expire_cache end
      • Peephole Optimization (before)
      x = true # 0008 getlocal x if x # 0010 branchunless 17 else # 0012 jump 14 end # 0014 putnil 0015 jump 18 0017 putnil 0018 leave
      • Peephole Optimization (after)
      x = true # 0008 getlocal x if x # 0010 branchunless 15 else # 0012 putnil end # 0013 leave 0014 pop 0015 putnil 0016 leave
      • OBJECTS
      • Object Requirements
      • Stateful
      • Identity
        • Unique identifier to represent the object at runtime
      • Methods
        • Change or query object state
        • Command and Query pattern
      • Object structure
        typedef unsigned long VALUE;
      struct RBasic { VALUE flags; # object flags VALUE klass; # instance of ... }
      • Object structure (cont)
      • Casting
        • Pointer type that represent addresses to language structures
        • RBASIC(obj)->flags
        • ((struct RBasic *)obj)->flags
      Flags
        • frozen
        • marked
        • tainted
        • embedded status
      • Classes / modules structure
        struct RClass {
      struct RBasic basic; # object structure rb_classext_t *ptr; # external class struct st_table *m_tbl; # method table struct st_table *iv_index_tbl; # ivars }
      • Class / module structure (cont)
      • Casting
        • RCLASS(a_str)->ptr.super #=> Object
        • RCLASS(a_fixnum)->ptr.super #=> Integer
      Attributes
        • Symbol tables for methods and ivars
        • Class / module distinction through flags
      • Special objects
      • Immediates
        • No runtime casting overheads – fits in VALUE
        • nil #=> 4
        • true #=> 2
        • false #=> 0
        • Symbols
        • Fixnums <= 30 bits
        • Floats and Bignum are complex objects – hence poor Floating Point benchmarks
        • RFLOAT(float_obj)->float_value #=> a double
      • Object memory layout
      • Object#object_id (32 bit architecture)
        • sizeof(VALUE) is 4 bytes
        • Objects, even, multiples of 4
        • Symbols, even, multiples of 8
        • Integers, odd
        • Immediates <= 4
      • Mutable Objects
        struct RString {
      struct RBasic basic; union {struct {long len; char *ptr union { long capa; VALUE shared; }aux; }heap;
      • Mutable Objects (cont)
      • String and Array
        • require the ability to shrink / grow capacity
        • allocates slightly more data than required
        • Avoids malloc, realloc and memmove overhead
        • Short strings “str”
        • Short arrays %w(a r y)
      • Shared Objects
        str = 'railsway';
      str2 = “#{str}con” # shared ref str3 = str << 'con' # copy + mod ary = %w(railsway con) ary2 = ary.dup # shared ref ary3 = ary2.delete_at(1) # copy + mod
      • Method Dispatch
      • Language constraints
        • Loose typing
        • Open classes
        • Method calls can never be reduced to CALL(a_method)
        • Search overhead
      • Language constraints
      • Dispatch sequence
      • Deref class pointer
      • Check methods table
      • Call method or delegate to superclass
    •  
      • call VS send
      • obj.__send__ :method
        • We never call methods
        • Send query / command messages to objects
        • Methods return values – RPC style messaging
      • Method cache
        • Method cache == router
        • 95% hit rate when warm
        • Method redefinition, module inclusion etc. clears the method cache / “routing table”
        • Introduces significant overhead for subsequent method calls
      • Method cache don'ts
        class SomeController < AC::Base
      • def show
      • # busts method cache for the whole VM
      • @user.extend SomeBehavior
      • end
      • end
    •  
    •  
    •  
      • Instance var changes
      • Optimizations
        • First 3 ivars is embedded on the object
        • Avoids symbol table lookups
      • ivar table
        • Table is per class, not per object
        • Ivar table is shared by all instances of the same class
        • Saves on memory footprint of a table per instance
      • GARBAGE COLLECTION
      • Process memory layout
      • Code segment
        • Executable code
        • Read only
      • Stack segment
        • Stack storage
        • Addressed with stack pointers
      • Heap Memory available for program / developer use
      • Malloc
      • Usable / free space
        • Managed by a free list
        • Linear search overhead to find free chunks
      • Better layout
        • Index free chunks by size intervals
    •  
    •  
      • GC terminology
      • Root set
        • Directly accessible without pointer scanning
        • C stack, global vars, global constants etc.
      • Unreachable hooks
        • Variable assignment to nil
        • method return etc.
      • Conservative VM hands out raw pointers to objects
      • GC strategies
      • Stop the World
        • Minimal allocation overhead
        • Hands out objects while heap space is available
        • Halts execution to reclaim memory
        • Very disruptive in the hot path
      • Incremental
        • Collection activity during allocation
        • Smoother, but with some minor overhead
        • Suitable for hard realtime environments
      • Scripting GC
      • Mark and Sweep
        • Identifies live objects
        • Assumes remainder is for collection
        • Concerned with unreachable objects
      • Stop and Copy
        • 2 heap spaces (double memory overhead)
        • 1 active, 1 inactive
        • Copies reachable chunks to the new active area
        • Concerned with live objects
      • Common GC Issues
      • Conservative GC
        • Memory fragmentation
        • Dangling pointers
        • Memory leaks from circular garbage
      • Allocation
        • Bursty allocation
        • Knowledge of pointer layout and chunks required
      • Ruby heap layout
      • Multiple heaps
        • Referenced through heap list
        • Composed of multiple slots
        • Freed when empty ...
        • IF all slots is tagged as being free
        • A Rails app allocates 4 to 6 heaps on startup
    •  
    •  
    •  
      • Slot layouts
      • Per heap
        • Each slot references a single object
        • Defaults to 10 000 slots for the first heap
        • Threshold of 4096 free slots per heap
        • Free list points to the next free slot
      • Heap growth
        • Next allocated heap has 1.8 capacity of the last one
        • That's why memory consumption's so high ...
      • Heap growth – small app
        >> 8 * 1.8
      • => 14.4
      • >> 8 * 1.8 * 1.8
      • => 25.92
      • >> 8 * 1.8 * 1.8 * 1.8
      • => 46.656
      • >> 8 * 1.8 * 1.8 * 1.8 * 1.8
      • => 83.9808
      • Heap growth – mid to large app
        => 83.9808
      • >> 8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8
      • => 151.16544
      • >> 8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8
      • => 272.097792
      • >> 8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8 * 1.8
      • => 489.7760256
      • Slot structure
        typedef struct RVALUE {
      • union {
      • struct {
      • VALUE flags; /* 0 when free */
      • struct RVALUE *next;
      • }free;
      • struct RObject object;
      • struct RFloat float;
      • ...
      • Pointer layout
      • Self describing
        • Program data area and heap
        • RVALUE union can accommodate any ruby object
        • Frames, variable structures etc. well defined also
        • 40 bytes (64 bit arch) represents a slot
        • Free list points to the next free slot
      • Ruby heap VS OS heap
      • Ruby heap
        • 20 bytes represents a slot
        • slot points to OS data, on the OS / system heap
      • OS heap
        • Thus a 20 byte slot can reference a 2MB chunk on the system heap
    •  
      • CRuby: Mark and Sweep
      • Conservative
        • Cannot determine with certainty if a value references an object – assume it's in use
      • Two phase implementation
        • Mark phase: identifies and flags reachable objects from the current program context
        • Sweep phase: iterates through the object space and …
        • free all objects not marked
        • unmark marked objects
      • Concerns
      • Performance
        • Runtime pauses
        • Work proportional to heap size
        • Prone to memory fragmentation (no compaction)
        • Recursive
      • Triggers
        • 8m malloc calls triggers GC
        • Every 8MB allocated triggers GC
        • Not enough heap reserve
      • GC in action
        # 4 objs, 1 Array, 3 Strings
      • ary1 = %w(a b c)
      • ary2 = %w(d e f)
      • # both ary1 and ary2 is reachable
      • ary1 = nil
      • # ary1 and it's contents is unreachable
    •  
    •  
    •  
      • Generational GC
      • Observations
        • Vast majority of objects are short lived – 80%+
        • Expensive to account for long lived objects
        • Parition by age and frequently collect short lived ones
      • How it works
        • Restrict GC to the most recently modified slots
        • These “sub heaps” are referred to as generations
        • Perform a full GC only when the youngest generation
        • fails to meet memory requirements
      • CONCURRENCY
      • Threading
      • Changes
        • Native OS Threads
        • Ruby Thread == pthread
        • Multiple cores ftw!
      • … but
        • Syscalls schedule, synchronize and create
        • Much more expensive to spawn and switch than green threads
        • Global VM Lock (GVL)
      • Global VM Lock (GVL)
      • How it works
        • Thread that owns the GVL is allowed to execute
        • Blocking operations should release the GVL
        • Automatically released when scheduled
        • C extensions : author does not concern with syncronization
      • Blocking VM operations
      • I/O
        • blocking reads and writes
        • DNS resolution or connects
        • Often has huge handshake overheads
      • Computations, processes and locks
        • Expensive Bignum ops blocked 1.8 interpreters
        • Process.waitpid
        • File locks
      • Releasing the GVL
      • Stable API
        • Blocking function: slow system call / computation
        • Unblock function: called on Thread interrupt
      • Pitfalls
      • Cannot access VALUEs (objects) in blocking functions
      • No integration with Ruby's exception / error handler
      • Lightweight Concurrency
      • Fibers
        • Coroutines – 4k stack size
        • Very fast user space context switches
        • Cooperative scheduling required
        • Fiber.yield pauses the activation record, which keeps context across multiple calls
      • Use cases
        • Generators
        • Blocking I/0 - Neverblock
      • In the pipeline
      • MVM: Multiple Virtual Machines
        • Shared process state
        • Sandboxed per VM application state
        • Distribute VMs across available cores
        • Message passing for inter VM communication
        • Most Ruby deployments aren't thread safe
        • MVM is well suited for this
      • QUESTIONS ?