Lazy Data Using Perl

Virtuous Lazyness
For Your Data

Doing it once and
knowing you've done it.

Steve Lembark
Workhorse Computing
lembark@wrkhors.com

Data wants to be free,
it also wants to be very expensive.
● Data you never use is free; accessing gets expensive.
● Managing data's cost requires controlling its lifecycle.
● Mismanaging the lifecycle causes problems:
● Caching table lookups in forked httpd crashes your database.
● Preprocessing messages files kills your startup times.
● Reading XML configuration data you never use wastes most of
your testing budget.
● Your tests fail because of a connection failure to a database
server that you don't use in the test.
● Fixing these problems requires using lazy data.

False Lazyness
● We have all see the two most common data
management stratagies:
● You load everything at startup to avoid checking it every
time it is used.
● You check everything every time before using it to avoid
loading it up front.
● Both approaches ignore important knowledge:
● You know when the data is needed.
● You know that the data was loaded.

One alternative: Scalar Cache
● $cache ||= read_cache_values;
● Seems nice: You know that $cache is populated.
● This requires dereferencing a hashref throughout the
program, which is expensive.
● What you'd rather do is just use $cache{ $foobar }
without having to check it every time.
● Even checking if( ! %cache ) is expensive – arrays
are much cheaper to check.

True Lazyness:
Do Something Once
● Truly Lazy data means loading you data when you
need it and knowing that it is loaded.
● Perl gives us – of course – more than one way:
● Trampoline objects and subroutines.
● “state” variables, introduced in v5.10.
● Trampolines are flyweight objects – data structures or
subroutines that transform themselves when used.
● State variables are assigned only once, at runtime, the
first time they are used.

Follow The Bouncing Object
● Object::Trampoline – flyweight data you don't know
isn't there.
● These delay calling the “real” constructor until the
object is actually used.
● Your constructor gets called before the first method call.
● At that point it can cache, parse, or compute whatever it
needs.
● Spreads the cost of loading data set each over the
lifetime of a process.

Example: Delay Expensive XML
● Use an initializer to read and parse the XML.
● The Object::Trampoline calls your constructor once
to transform the object the first time you call a
method.
● Parsing the XML is pushed off until you actually use
the data.
● Requires using a hashref – and possibly methods –
to access the data.

package XML::Message;
use Object::Trampoline;
...

sub new
{
Object::Trampline->install( 'XML::Message', $path );
}

sub install
{
my $error = &construct;

$error->initialize( @_ )
}

sub initialize
{
my ( $err, $path ) = @_;

%$err = %{ XMLin $path => @lots_of_args };

$err
}

# calling translate bounces the trampoline exactly once

sub translate { … }

Tramploline Subroutine
● Similar to a trampoline object: A portion of the code
runs once and replaces itself.
● Oneshot code initializes the cache, which can now
be a simple hash.
● Symbol::qualify_to_ref make this painless in Perl.
● An anonymous sub manages the cache.
● A named subroutine loads the cache, replaces itself with
the manager, and redispatches to the manager.

The Simplest Version
my %foo_cache = ();
● Minimal code
my $handler
= sub includes:
{
my $foo = shift; ● Closedover cache,
%foo_cache{ $foo }
Or die “Bogus foo: '$foo' unknown”
● Subref to permenant
}; cache manager,
sub do_something
{
● Initial subroutine.
%cache = initialize_foo_cache;
● The initial subroutine
my $ref
= qualify_to_ref 'do_something'; initializes, installs,
*$ref = $handler;
and redispatches.
goto &$handler
}

BEGIN Blocks Are Cleaner
BEGIN
{
my $name = 'foo_handler'; ● The block isolates
my $ref = qualify_to_ref $name;
my %cache = (); cache, ref, and
my $handler sub variables;
= sub
{ allows recycling
%cache{ $_[0] } or die ...
}; the ref.
*$ref
= sub
● This is also rather
{
%cache = init_the_cache;
amenable to
*$ref = $handler;
installation by
module.
goto &$handler
}
}

Sub::Trampoline
sub install_trampoline
{
my ( $name, $init, $mgr ) = @_;
● Aside from the actual
my $caller = caller;
assignment, init code is
my $ref
= qualify_to_ref $name, $caller;
identical.
*$ref ● Simply pass the name,
= sub
{ manager, and init
$init->();
assignment.
&$ref = $mgr;
● The module can call
goto &$mgr
} $init, replace itself.
}
● Caller defines the
cache and handler.

Using Sub::Trampoline
use Sub::Trampoline;

my %cache1 = ();
my @cache2 = ();

my $cache1_mgr
= sub { $cache1{ $_[0] } or croak "Unknown '$_[0]'" };

my $cache2_mgr
= sub { first { $_ eq $_[0] } @cache2 } };

my $init_cache1 = sub { %cache1 = select_from_hell }
sub init_cache2 { @cache2 = XMLin $nasty_messy_xml_struct }

install_trampoline( subone => $init_cache1, $cache1_mgr );
install_trampoline( known => &init_cache2, $cache2_mgr );

my $value1 = subone $key1;

if( known $value )
{ … }
else
{ carp “Unknown: '$value'” }

True OneShot: Empty $handler
● If you want to run something exactly once but don't
know where it might be called initially:
my $manager = sub(){};
● You can also substitute a trampoline object with a
constructor that does the work and no methods.
● Calling the object once constructs it, after which the
classes constructor can stub itself.
● Useful for sharing the cache variable: the init
populates it once and stubs itself to do nothing more.

Cycling The Cache
● There are times when you want to purge and re
initialize the cache
● A trampoline object with populate and use subs that
flipflop can handle this easily.
● Reassigning a trampoline reinitializes the cache:

$cache
= Object::Trampoline->init_cache( $class => @argz )
if $age > $time_max;

v5.10 Introduced “state” Variables
● Scoped like a lexical.
● Assinged once at runtime.
● Maintain value within a single lexical context
throughout the program.
● Assign the cache or assign a flag variable with the
side effect of populating the cache.
● Currently supports only scalars.

Obvious case: Assign the cache.
● Assign the cache at runtime:
sub cache_mangler { state $cache = init_cache; … }
● $cache will be assigned at runtime, the first time
cache_mangler is called.
● The value will be retained between calls.
● Catch: $cache is only available within
cache_mangler, not outside of it.

Initialize a Shared Cache
use v5.10; ● Subs may want to share
my %cache = (); a cache.
sub init ● $y and $z are assigned
{
%k or %k = …; at most once per
}
executison when foo or
sub foo { state $y = init; … } bar are called.
sub bar { state $z = init; … }

…
● The sanity check in init
only needs to be handled
my $foo = foo 'bletch';
my $bar = bar 'blort'; at most per subroutine.

Summary
● True lazyness includes managing data.
● Preloading it all or testing it at each step are not lazy.
● Object::Trampoline provides one way.
● Trampoline subroutines offer another approach.
● v5.10 introduced state variables which provide a few
ways to initialize something once.

Lazy Data Using Perl

More Related Content

What's hot

Similar to Lazy Data Using Perl

More from Workhorse Computing

Recently uploaded

Lazy Data Using Perl