SlideShare a Scribd company logo
Expressiveness, Simplicity, and Users Craig Chambers Google
A Brief Bio MIT: 82-86 Argus, with Barbara Liskov, Bill Weihl, Mark Day Stanford: 86-91 Self, with David Ungar, UrsHölzle, … U. of Washington: 91-07 Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ... Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, …  Google: 07- Flume, …
Some Questions What makes an idea successful? Which ideas are adopted most? Which ideas have the most impact?
Outline Some past projects Self language, Self compiler Cecil language, Vortex compiler A current project Flume: data-parallel programming system
Self Language[Ungar & Smith 87] Purified essence of Smalltalk-like languages all data are objects no classes all actions are messages field accesses, control structures Core ideas are very simple widely cited and understood
Self v2[Chambers, Ungar, Chang 91] Added encapsulation and privacy Added prioritized multiple inheritance supported both ordered and unordered mult. inh. Sophisticated, or complicated? Unified, or kitchen sink? Not adopted; dropped from Self v3
Self Compiler[Chambers, Ungar 89-91] Dynamic optimizer (an early JIT compiler) Customization: specialize code for each receiver class Class/type dataflow analysis; lots of inlining Lazy compilation of uncommon code paths 89: customization + simple analysis: effective 90: + complicated analysis: more effective but slow 91: + lazy compilation: still more effective, and fast [Hölzle, … 92-94]: + dynamic type feedback: zowie! Simple analysis + type feedback widely adopted
Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99] Pure objects, pure messages Multimethods, static typechecking encapsulation modules, modular typechecking constraint-based polymorphic type system integrates F-bounded poly. and “where” clauses later: MultiJava, EML [Lee], Diesel, … Work on multimethods, “open classes” is well-known Multimethods not widely available  
Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01] Whole-program optimizer, for Cecil, Java, … Class hierarchy analysis Profile-guided class/type feedback Dataflow analysis, code specialization Interprocedural static class/type analysis Fast context-insensitive [Defouw], context-sensitive Incremental recompilation; composable dataflow analyses Project well-known CHA: my most cited paper; a very simple idea More-sophisticated work less widely adopted
Some Other Work DyC [Grant, Philipose, Mock, Eggers 96-00] Dynamic compilation for C ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …] PL support for software architecture Cobalt, Rhodium [Lerner, Millstein 02-05 …] Provably correct compiler optimizations
Trends Simpler ideas easier to adopt Sophisticated ideas need a simple story to be impactful Ideal: “deceptively simple” Unification != Swiss Army Knife Language papers have had more citations;compiler work has had more practical impact The combination can work well
A Current Project:Flume[Chambers, Raniwala, Perry, ... 10] Make data-parallel MapReduce-like pipelineseasy to write  yetefficient to run
Data-Parallel Programming Analyze & transform large, homogeneous data sets, processing separate elements in parallel Web pages Click logs Purchase records Geographical data sets Census data … Ideal: “embarrassingly parallel” analysis ofpetabytes of data
Challenges Parallel distributed programming is hard To do: Assign machines Distribute program binaries Partition input data across machines Synchronize jobs, communicate data when needed Monitor jobs Deal with faults in programs, machines, network, … Tune: stragglers, work stealing, … What if user is a domain expert, not a systems/PL expert?
MapReduce[Dean & Ghemawat, 04] purchases queries map item -> co-item term -> hour+city shuffle item -> all co-items term-> (hour+city)* reduce item -> recommend term-> what’s hot, when
MapReduce Greatly eases writing fault-tolerant data-parallel programs Handles many tedious and/or tricky details Has excellent (batch) performance Offers a simple programming model Lots of knobs for tuning Pipelines of MapReduces? Additional details to handle temp files pipeline control Programming model becomes low-level
Flume Ease task of writing data-parallel pipelines Offer high-level data-parallel abstractions,as a Java or C++ library Classes for (possibly huge) immutable collections Methods for data-parallel operations Easily composed to form pipelines Entire pipeline in a single program Automatically optimize and execute pipeline,e.g., via a series of MapReduces Manage lower-level details automatically
Flume Classes and Methods Core data-parallel collection classes: PCollection<T>,  PTable<K,V> Core data-parallel methods: parallelDo(DoFn) groupByKey() combineValues(CombineFn) flatten(...) read(Source), writeTo(Sink), … Derive other methods from these primitives: join(...), count(),  top(CompareFn,N), ...
Example: TopWords PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”)); PCollection<String> words =lines.parallelDo(newExtractWordsFn()); PTable<String, Long> wordCounts =words.count(); PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000); PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn()); formattedOutput.writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
Example: TopWords read(TextIO.source(“/gfs/corpus/*.txt”)) .parallelDo(newExtractWordsFn()) .count() .top(new OrderCountsFn(), 1000) .parallelDo(new FormatCountFn()) .writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
Execution Graph Data-parallel primitives (e.g., parallelDo) are “lazy” Don’t actually run right away, but wait until demanded Calls to primitives build an execution graph Nodes are operations to be performed Edges are PCollections that will hold the results An unevaluated result PCollection is a “future” Points to the graph that computes it Derived operations (e.g., count, user code) call lazy primitives and so get inlined away Evaluation is “demanded” by FlumeJava.run() Optimizes, then executes
read read(TextIO.source(“/…/*.txt”)) pDo parallelDo(newExtractWordsFn()) pDo count() gbk Execution Graph cv pDo gbk top(new OrderCountsFn(), 1000) pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
Optimizer Fuse trees of parallelDo operations into one Producer-consumer,co-consumers (“siblings”) Eliminate now-unused intermediate PCollections Form MapReduces pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR) General: multi-mapper, multi-reducer, multi-output pDo pDo pDo pDo pDo pDo
read read(TextIO.source(“/…/*.txt”)) mscr pDo pDo parallelDo(newExtractWordsFn()) pDo count() gbk Final Pipeline Fusion cv mscr pDo 8 operations 2 operations gbk top(new OrderCountsFn(), 1000) pDo pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
Executor Runs each optimized MSCR If small data, runs locally, sequentially develop and test in normal IDE If large data, runs remotely, in parallel Handles creating, deleting temp files Supports fast re-execution of incomplete runs Caches, reuses partial pipeline results
Another Example: SiteData GetPScoreFn, GetVerticalFn pDo pDo pDo GetDocInfoFn gbk PickBestFn cv pDo pDo pDo join() gbk pDo pDo MakeDocTraitsFn
Another Example: SiteData pDo pDo pDo pDo mscr mscr pDo gbk cv pDo pDo pDo 11 ops 2 ops gbk pDo pDo pDo
Experience FlumeJava released to Google users in May 2009 Now: hundreds of pipelines run by hundreds of users every month Real pipelines process megabytes <=> petabytes Users find FlumeJava a lot easier than MapReduce Advanced users can exert control over optimizer and executor if/when necessary But when things go wrong, lower abstraction levels intrude
How Well Does It Work? How does FlumeJava compare in speed to: an equally modular Java MapReduce pipeline? a hand-optimized Java MapReduce pipeline? a hand-optimized Sawzall pipeline? Sawzall: language for logs processing How big are pipelines in practice? How much does the optimizer help?
Performance
Optimizer Impact
Current and Future Work FlumeC++ just released to Google users Auto-tuner Profile executions,choose good settings for tuning MapReduces Other execution substrates than MapReduce Continuous/streaming execution? Dynamic code generation and optimization?
A More Advanced Approach Apply advanced PL ideas to the data-parallel domain A custom language tuned to this domain A sophisticated static optimizer and code generator An integrated parallel run-time system
Lumberjack A language designed for data-parallel programming An implicitly parallel model All collections potentially PCollections All loops potentially parallel Functional Mostly side-effect free Concise lambdas Advanced type system to minimize verbosity
Static Optimizer Decide which collections are PCollections,which loops are parallel loops Interprocedural context-sensitive analysis OO type analysis side-effect analysis inlining dead assignment elimination …
Parallel Run-Time System Similar to Flume’s run-time system Schedules MapReduces Manages temp files Handles faults
Result: Not Successful A new language is a hard sell to most developers Language details obscure key new concepts Hard to be proficient in yet another language with yet another syntax Libraries? Increases risk to their projects Optimizer constrained by limits of static analysis
Response: FlumeJava Replace custom language with Java + Flume library More verbose syntactically ,[object Object]
All standard libraries & coding idioms preserved
Much less risk
Easy to try out, easy to like, easy to adopt

More Related Content

What's hot

Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
Sandy Ryza
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Databricks
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Christopher Curtin
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
Adam Gibson
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
HPCC Systems
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
Sri Ambati
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
Sri Ambati
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
L15 Data Source Layer
L15 Data Source LayerL15 Data Source Layer
L15 Data Source Layer
Ólafur Andri Ragnarsson
 

What's hot (20)

Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
L15 Data Source Layer
L15 Data Source LayerL15 Data Source Layer
L15 Data Source Layer
 

Viewers also liked

Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015Emily Okonjo
 
Maereg CVV
Maereg CVVMaereg CVV
Using triangles in Technical Analysis
Using triangles  in Technical AnalysisUsing triangles  in Technical Analysis
Using triangles in Technical Analysis
Sangamesh K.S
 
Proyecto tic numero 12
Proyecto tic numero 12Proyecto tic numero 12
Proyecto tic numero 12
Paco Amoros
 
малинин
малининмалинин
малинин
agrashhenkova
 
Lojas virtuais
Lojas virtuaisLojas virtuais
Etpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fleEtpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fle
francesdesecundaria
 
Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?
Diplomeo
 
C.V.
C.V.C.V.
O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9
Richard Jesus
 

Viewers also liked (11)

Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015
 
Maereg CVV
Maereg CVVMaereg CVV
Maereg CVV
 
Using triangles in Technical Analysis
Using triangles  in Technical AnalysisUsing triangles  in Technical Analysis
Using triangles in Technical Analysis
 
Proyecto tic numero 12
Proyecto tic numero 12Proyecto tic numero 12
Proyecto tic numero 12
 
American university ms back side
American university ms back sideAmerican university ms back side
American university ms back side
 
малинин
малининмалинин
малинин
 
Lojas virtuais
Lojas virtuaisLojas virtuais
Lojas virtuais
 
Etpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fleEtpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fle
 
Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?
 
C.V.
C.V.C.V.
C.V.
 
O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9
 

Similar to Expressiveness, Simplicity and Users

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
Kathirvel Ayyaswamy
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Yahoo Developer Network
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
Christopher Curtin
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters Defense
derDoc
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
kaveirious
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
zahid-mian
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Paco Nathan
 
Software engineering
Software engineeringSoftware engineering
Software engineering
Fahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineering
Fahe Em
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
Joel Falcou
 

Similar to Expressiveness, Simplicity and Users (20)

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters Defense
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 

More from greenwop

Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
greenwop
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service ClientsUnifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clients
greenwop
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) DataCategory theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Data
greenwop
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOLA Featherweight Approach to FOOL
A Featherweight Approach to FOOL
greenwop
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languages
greenwop
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful RacketTurning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racket
greenwop
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
greenwop
 
Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?
greenwop
 
High Performance JavaScript
High Performance JavaScriptHigh Performance JavaScript
High Performance JavaScript
greenwop
 

More from greenwop (9)

Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service ClientsUnifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clients
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) DataCategory theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Data
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOLA Featherweight Approach to FOOL
A Featherweight Approach to FOOL
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languages
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful RacketTurning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racket
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
 
Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?
 
High Performance JavaScript
High Performance JavaScriptHigh Performance JavaScript
High Performance JavaScript
 

Recently uploaded

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 

Recently uploaded (20)

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 

Expressiveness, Simplicity and Users

  • 1. Expressiveness, Simplicity, and Users Craig Chambers Google
  • 2. A Brief Bio MIT: 82-86 Argus, with Barbara Liskov, Bill Weihl, Mark Day Stanford: 86-91 Self, with David Ungar, UrsHölzle, … U. of Washington: 91-07 Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ... Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, … Google: 07- Flume, …
  • 3. Some Questions What makes an idea successful? Which ideas are adopted most? Which ideas have the most impact?
  • 4. Outline Some past projects Self language, Self compiler Cecil language, Vortex compiler A current project Flume: data-parallel programming system
  • 5. Self Language[Ungar & Smith 87] Purified essence of Smalltalk-like languages all data are objects no classes all actions are messages field accesses, control structures Core ideas are very simple widely cited and understood
  • 6. Self v2[Chambers, Ungar, Chang 91] Added encapsulation and privacy Added prioritized multiple inheritance supported both ordered and unordered mult. inh. Sophisticated, or complicated? Unified, or kitchen sink? Not adopted; dropped from Self v3
  • 7. Self Compiler[Chambers, Ungar 89-91] Dynamic optimizer (an early JIT compiler) Customization: specialize code for each receiver class Class/type dataflow analysis; lots of inlining Lazy compilation of uncommon code paths 89: customization + simple analysis: effective 90: + complicated analysis: more effective but slow 91: + lazy compilation: still more effective, and fast [Hölzle, … 92-94]: + dynamic type feedback: zowie! Simple analysis + type feedback widely adopted
  • 8. Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99] Pure objects, pure messages Multimethods, static typechecking encapsulation modules, modular typechecking constraint-based polymorphic type system integrates F-bounded poly. and “where” clauses later: MultiJava, EML [Lee], Diesel, … Work on multimethods, “open classes” is well-known Multimethods not widely available 
  • 9. Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01] Whole-program optimizer, for Cecil, Java, … Class hierarchy analysis Profile-guided class/type feedback Dataflow analysis, code specialization Interprocedural static class/type analysis Fast context-insensitive [Defouw], context-sensitive Incremental recompilation; composable dataflow analyses Project well-known CHA: my most cited paper; a very simple idea More-sophisticated work less widely adopted
  • 10. Some Other Work DyC [Grant, Philipose, Mock, Eggers 96-00] Dynamic compilation for C ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …] PL support for software architecture Cobalt, Rhodium [Lerner, Millstein 02-05 …] Provably correct compiler optimizations
  • 11. Trends Simpler ideas easier to adopt Sophisticated ideas need a simple story to be impactful Ideal: “deceptively simple” Unification != Swiss Army Knife Language papers have had more citations;compiler work has had more practical impact The combination can work well
  • 12. A Current Project:Flume[Chambers, Raniwala, Perry, ... 10] Make data-parallel MapReduce-like pipelineseasy to write yetefficient to run
  • 13. Data-Parallel Programming Analyze & transform large, homogeneous data sets, processing separate elements in parallel Web pages Click logs Purchase records Geographical data sets Census data … Ideal: “embarrassingly parallel” analysis ofpetabytes of data
  • 14. Challenges Parallel distributed programming is hard To do: Assign machines Distribute program binaries Partition input data across machines Synchronize jobs, communicate data when needed Monitor jobs Deal with faults in programs, machines, network, … Tune: stragglers, work stealing, … What if user is a domain expert, not a systems/PL expert?
  • 15. MapReduce[Dean & Ghemawat, 04] purchases queries map item -> co-item term -> hour+city shuffle item -> all co-items term-> (hour+city)* reduce item -> recommend term-> what’s hot, when
  • 16. MapReduce Greatly eases writing fault-tolerant data-parallel programs Handles many tedious and/or tricky details Has excellent (batch) performance Offers a simple programming model Lots of knobs for tuning Pipelines of MapReduces? Additional details to handle temp files pipeline control Programming model becomes low-level
  • 17. Flume Ease task of writing data-parallel pipelines Offer high-level data-parallel abstractions,as a Java or C++ library Classes for (possibly huge) immutable collections Methods for data-parallel operations Easily composed to form pipelines Entire pipeline in a single program Automatically optimize and execute pipeline,e.g., via a series of MapReduces Manage lower-level details automatically
  • 18. Flume Classes and Methods Core data-parallel collection classes: PCollection<T>, PTable<K,V> Core data-parallel methods: parallelDo(DoFn) groupByKey() combineValues(CombineFn) flatten(...) read(Source), writeTo(Sink), … Derive other methods from these primitives: join(...), count(), top(CompareFn,N), ...
  • 19. Example: TopWords PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”)); PCollection<String> words =lines.parallelDo(newExtractWordsFn()); PTable<String, Long> wordCounts =words.count(); PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000); PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn()); formattedOutput.writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
  • 20. Example: TopWords read(TextIO.source(“/gfs/corpus/*.txt”)) .parallelDo(newExtractWordsFn()) .count() .top(new OrderCountsFn(), 1000) .parallelDo(new FormatCountFn()) .writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
  • 21. Execution Graph Data-parallel primitives (e.g., parallelDo) are “lazy” Don’t actually run right away, but wait until demanded Calls to primitives build an execution graph Nodes are operations to be performed Edges are PCollections that will hold the results An unevaluated result PCollection is a “future” Points to the graph that computes it Derived operations (e.g., count, user code) call lazy primitives and so get inlined away Evaluation is “demanded” by FlumeJava.run() Optimizes, then executes
  • 22. read read(TextIO.source(“/…/*.txt”)) pDo parallelDo(newExtractWordsFn()) pDo count() gbk Execution Graph cv pDo gbk top(new OrderCountsFn(), 1000) pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
  • 23. Optimizer Fuse trees of parallelDo operations into one Producer-consumer,co-consumers (“siblings”) Eliminate now-unused intermediate PCollections Form MapReduces pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR) General: multi-mapper, multi-reducer, multi-output pDo pDo pDo pDo pDo pDo
  • 24. read read(TextIO.source(“/…/*.txt”)) mscr pDo pDo parallelDo(newExtractWordsFn()) pDo count() gbk Final Pipeline Fusion cv mscr pDo 8 operations 2 operations gbk top(new OrderCountsFn(), 1000) pDo pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
  • 25. Executor Runs each optimized MSCR If small data, runs locally, sequentially develop and test in normal IDE If large data, runs remotely, in parallel Handles creating, deleting temp files Supports fast re-execution of incomplete runs Caches, reuses partial pipeline results
  • 26. Another Example: SiteData GetPScoreFn, GetVerticalFn pDo pDo pDo GetDocInfoFn gbk PickBestFn cv pDo pDo pDo join() gbk pDo pDo MakeDocTraitsFn
  • 27. Another Example: SiteData pDo pDo pDo pDo mscr mscr pDo gbk cv pDo pDo pDo 11 ops 2 ops gbk pDo pDo pDo
  • 28. Experience FlumeJava released to Google users in May 2009 Now: hundreds of pipelines run by hundreds of users every month Real pipelines process megabytes <=> petabytes Users find FlumeJava a lot easier than MapReduce Advanced users can exert control over optimizer and executor if/when necessary But when things go wrong, lower abstraction levels intrude
  • 29. How Well Does It Work? How does FlumeJava compare in speed to: an equally modular Java MapReduce pipeline? a hand-optimized Java MapReduce pipeline? a hand-optimized Sawzall pipeline? Sawzall: language for logs processing How big are pipelines in practice? How much does the optimizer help?
  • 32. Current and Future Work FlumeC++ just released to Google users Auto-tuner Profile executions,choose good settings for tuning MapReduces Other execution substrates than MapReduce Continuous/streaming execution? Dynamic code generation and optimization?
  • 33. A More Advanced Approach Apply advanced PL ideas to the data-parallel domain A custom language tuned to this domain A sophisticated static optimizer and code generator An integrated parallel run-time system
  • 34. Lumberjack A language designed for data-parallel programming An implicitly parallel model All collections potentially PCollections All loops potentially parallel Functional Mostly side-effect free Concise lambdas Advanced type system to minimize verbosity
  • 35. Static Optimizer Decide which collections are PCollections,which loops are parallel loops Interprocedural context-sensitive analysis OO type analysis side-effect analysis inlining dead assignment elimination …
  • 36. Parallel Run-Time System Similar to Flume’s run-time system Schedules MapReduces Manages temp files Handles faults
  • 37. Result: Not Successful A new language is a hard sell to most developers Language details obscure key new concepts Hard to be proficient in yet another language with yet another syntax Libraries? Increases risk to their projects Optimizer constrained by limits of static analysis
  • 38.
  • 39. All standard libraries & coding idioms preserved
  • 41. Easy to try out, easy to like, easy to adopt
  • 42. Dynamic optimizer less constrained than static optimizer
  • 44.
  • 45. Conclusions Simpler ideas easier to adopt By researchers and by users Sophisticated ideas still needed,to support simple interfaces Doing things dynamically instead of staticallycan be liberating