• Save
Expressiveness, Simplicity and Users
Upcoming SlideShare
Loading in...5
×
 

Expressiveness, Simplicity and Users

on

  • 5,044 views

Craig Chambers' ECOOP 2011 Keynote talk.

Craig Chambers' ECOOP 2011 Keynote talk.

Statistics

Views

Total Views
5,044
Views on SlideShare
1,753
Embed Views
3,291

Actions

Likes
3
Downloads
0
Comments
0

3 Embeds 3,291

http://ecoop11.comp.lancs.ac.uk 3117
http://scc-sentinel.lancs.ac.uk 173
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Expressiveness, Simplicity and Users Expressiveness, Simplicity and Users Presentation Transcript

  • Expressiveness, Simplicity, and Users
    Craig Chambers
    Google
  • A Brief Bio
    MIT: 82-86
    Argus, with Barbara Liskov, Bill Weihl, Mark Day
    Stanford: 86-91
    Self, with David Ungar, UrsHölzle, …
    U. of Washington: 91-07
    Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ...
    Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, …
    Google: 07-
    Flume, …
  • Some Questions
    What makes an idea successful?
    Which ideas are adopted most?
    Which ideas have the most impact?
  • Outline
    Some past projects
    Self language, Self compiler
    Cecil language, Vortex compiler
    A current project
    Flume: data-parallel programming system
  • Self Language[Ungar & Smith 87]
    Purified essence of Smalltalk-like languages
    all data are objects
    no classes
    all actions are messages
    field accesses, control structures
    Core ideas are very simple
    widely cited and understood
  • Self v2[Chambers, Ungar, Chang 91]
    Added encapsulation and privacy
    Added prioritized multiple inheritance
    supported both ordered and unordered mult. inh.
    Sophisticated, or complicated?
    Unified, or kitchen sink?
    Not adopted; dropped from Self v3
  • Self Compiler[Chambers, Ungar 89-91]
    Dynamic optimizer (an early JIT compiler)
    Customization: specialize code for each receiver class
    Class/type dataflow analysis; lots of inlining
    Lazy compilation of uncommon code paths
    89: customization + simple analysis: effective
    90: + complicated analysis: more effective but slow
    91: + lazy compilation: still more effective, and fast
    [Hölzle, … 92-94]: + dynamic type feedback: zowie!
    Simple analysis + type feedback widely adopted
  • Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99]
    Pure objects, pure messages
    Multimethods, static typechecking
    encapsulation
    modules, modular typechecking
    constraint-based polymorphic type system
    integrates F-bounded poly. and “where” clauses
    later: MultiJava, EML [Lee], Diesel, …
    Work on multimethods, “open classes” is well-known
    Multimethods not widely available 
  • Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01]
    Whole-program optimizer, for Cecil, Java, …
    Class hierarchy analysis
    Profile-guided class/type feedback
    Dataflow analysis, code specialization
    Interprocedural static class/type analysis
    Fast context-insensitive [Defouw], context-sensitive
    Incremental recompilation; composable dataflow analyses
    Project well-known
    CHA: my most cited paper; a very simple idea
    More-sophisticated work less widely adopted
  • Some Other Work
    DyC [Grant, Philipose, Mock, Eggers 96-00]
    Dynamic compilation for C
    ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …]
    PL support for software architecture
    Cobalt, Rhodium [Lerner, Millstein 02-05 …]
    Provably correct compiler optimizations
  • Trends
    Simpler ideas easier to adopt
    Sophisticated ideas need a simple story to be impactful
    Ideal: “deceptively simple”
    Unification != Swiss Army Knife
    Language papers have had more citations;compiler work has had more practical impact
    The combination can work well
  • A Current Project:Flume[Chambers, Raniwala, Perry, ... 10]
    Make data-parallel MapReduce-like pipelineseasy to write
    yetefficient to run
  • Data-Parallel Programming
    Analyze & transform large, homogeneous data sets, processing separate elements in parallel
    Web pages
    Click logs
    Purchase records
    Geographical data sets
    Census data

    Ideal: “embarrassingly parallel” analysis ofpetabytes of data
  • Challenges
    Parallel distributed programming is hard
    To do:
    Assign machines
    Distribute program binaries
    Partition input data across machines
    Synchronize jobs, communicate data when needed
    Monitor jobs
    Deal with faults in programs, machines, network, …
    Tune: stragglers, work stealing, …
    What if user is a domain expert, not a systems/PL expert?
  • MapReduce[Dean & Ghemawat, 04]
    purchases
    queries
    map
    item ->
    co-item
    term ->
    hour+city
    shuffle
    item ->
    all co-items
    term->
    (hour+city)*
    reduce
    item ->
    recommend
    term->
    what’s hot, when
  • MapReduce
    Greatly eases writing fault-tolerant data-parallel programs
    Handles many tedious and/or tricky details
    Has excellent (batch) performance
    Offers a simple programming model
    Lots of knobs for tuning
    Pipelines of MapReduces?
    Additional details to handle
    temp files
    pipeline control
    Programming model becomes low-level
  • Flume
    Ease task of writing data-parallel pipelines
    Offer high-level data-parallel abstractions,as a Java or C++ library
    Classes for (possibly huge) immutable collections
    Methods for data-parallel operations
    Easily composed to form pipelines
    Entire pipeline in a single program
    Automatically optimize and execute pipeline,e.g., via a series of MapReduces
    Manage lower-level details automatically
  • Flume Classes and Methods
    Core data-parallel collection classes:
    PCollection<T>, PTable<K,V>
    Core data-parallel methods:
    parallelDo(DoFn)
    groupByKey()
    combineValues(CombineFn)
    flatten(...)
    read(Source), writeTo(Sink), …
    Derive other methods from these primitives:
    join(...), count(), top(CompareFn,N), ...
  • Example: TopWords
    PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”));
    PCollection<String> words =lines.parallelDo(newExtractWordsFn());
    PTable<String, Long> wordCounts =words.count();
    PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000);
    PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn());
    formattedOutput.writeTo(TextIO.sink(“cnts.txt”));
    FlumeJava.run();
  • Example: TopWords
    read(TextIO.source(“/gfs/corpus/*.txt”))
    .parallelDo(newExtractWordsFn())
    .count()
    .top(new OrderCountsFn(), 1000)
    .parallelDo(new FormatCountFn())
    .writeTo(TextIO.sink(“cnts.txt”));
    FlumeJava.run();
  • Execution Graph
    Data-parallel primitives (e.g., parallelDo) are “lazy”
    Don’t actually run right away, but wait until demanded
    Calls to primitives build an execution graph
    Nodes are operations to be performed
    Edges are PCollections that will hold the results
    An unevaluated result PCollection is a “future”
    Points to the graph that computes it
    Derived operations (e.g., count, user code) call lazy primitives and so get inlined away
    Evaluation is “demanded” by FlumeJava.run()
    Optimizes, then executes
  • read
    read(TextIO.source(“/…/*.txt”))
    pDo
    parallelDo(newExtractWordsFn())
    pDo
    count()
    gbk
    Execution Graph
    cv
    pDo
    gbk
    top(new OrderCountsFn(), 1000)
    pDo
    pDo
    parallelDo(new FormatCountFn())
    write
    writeTo(TextIO.sink(“cnts.txt”))
  • Optimizer
    Fuse trees of parallelDo operations into one
    Producer-consumer,co-consumers (“siblings”)
    Eliminate now-unused intermediate PCollections
    Form MapReduces
    pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR)
    General: multi-mapper, multi-reducer, multi-output
    pDo
    pDo
    pDo
    pDo
    pDo
    pDo
  • read
    read(TextIO.source(“/…/*.txt”))
    mscr
    pDo
    pDo
    parallelDo(newExtractWordsFn())
    pDo
    count()
    gbk
    Final Pipeline
    Fusion
    cv
    mscr
    pDo
    8 operations 2 operations
    gbk
    top(new OrderCountsFn(), 1000)
    pDo
    pDo
    pDo
    parallelDo(new FormatCountFn())
    write
    writeTo(TextIO.sink(“cnts.txt”))
  • Executor
    Runs each optimized MSCR
    If small data, runs locally, sequentially
    develop and test in normal IDE
    If large data, runs remotely, in parallel
    Handles creating, deleting temp files
    Supports fast re-execution of incomplete runs
    Caches, reuses partial pipeline results
  • Another Example: SiteData
    GetPScoreFn,
    GetVerticalFn
    pDo
    pDo
    pDo
    GetDocInfoFn
    gbk
    PickBestFn
    cv
    pDo
    pDo
    pDo
    join()
    gbk
    pDo
    pDo
    MakeDocTraitsFn
  • Another Example: SiteData
    pDo
    pDo
    pDo
    pDo
    mscr
    mscr
    pDo
    gbk
    cv
    pDo
    pDo
    pDo
    11 ops 2 ops
    gbk
    pDo
    pDo
    pDo
  • Experience
    FlumeJava released to Google users in May 2009
    Now: hundreds of pipelines run by hundreds of users every month
    Real pipelines process megabytes <=> petabytes
    Users find FlumeJava a lot easier than MapReduce
    Advanced users can exert control over optimizer and executor if/when necessary
    But when things go wrong, lower abstraction levels intrude
  • How Well Does It Work?
    How does FlumeJava compare in speed to:
    an equally modular Java MapReduce pipeline?
    a hand-optimized Java MapReduce pipeline?
    a hand-optimized Sawzall pipeline?
    Sawzall: language for logs processing
    How big are pipelines in practice?
    How much does the optimizer help?
  • Performance
  • Optimizer Impact
  • Current and Future Work
    FlumeC++ just released to Google users
    Auto-tuner
    Profile executions,choose good settings for tuning MapReduces
    Other execution substrates than MapReduce
    Continuous/streaming execution?
    Dynamic code generation and optimization?
  • A More Advanced Approach
    Apply advanced PL ideas to the data-parallel domain
    A custom language tuned to this domain
    A sophisticated static optimizer and code generator
    An integrated parallel run-time system
  • Lumberjack
    A language designed for data-parallel programming
    An implicitly parallel model
    All collections potentially PCollections
    All loops potentially parallel
    Functional
    Mostly side-effect free
    Concise lambdas
    Advanced type system to minimize verbosity
  • Static Optimizer
    Decide which collections are PCollections,which loops are parallel loops
    Interprocedural context-sensitive analysis
    OO type analysis
    side-effect analysis
    inlining
    dead assignment elimination

  • Parallel Run-Time System
    Similar to Flume’s run-time system
    Schedules MapReduces
    Manages temp files
    Handles faults
  • Result: Not Successful
    A new language is a hard sell to most developers
    Language details obscure key new concepts
    Hard to be proficient in yet another language with yet another syntax
    Libraries?
    Increases risk to their projects
    Optimizer constrained by limits of static analysis
  • Response: FlumeJava
    Replace custom language with Java + Flume library
    More verbose syntactically
    • Flume abstractions highlighted
    • All standard libraries & coding idioms preserved
    • Much less risk
    • Easy to try out, easy to like, easy to adopt
    • Dynamic optimizer less constrained than static optimizer
    • Reuse parallel run-time system
    • Sophistication and novelty can hinder adoption
  • Some Related Systems
    Hadoop, Cascading
    C#/LINQ, Dryad
    Pig, PigLatin
    streaming languages (e.g. StreamIt, Brook)
    database query optimizers
  • Conclusions
    Simpler ideas easier to adopt
    By researchers and by users
    Sophisticated ideas still needed,to support simple interfaces
    Doing things dynamically instead of staticallycan be liberating
  • Thanks!