• Share
  • Email
  • Embed
  • Like
  • Private Content
The Workflow Abstraction
 

The Workflow Abstraction

on

  • 1,126 views

Strata CA 2013 talk "The Workflow Abstraction" by Paco Nathan about Cascading and related open source projects for building Enterprise Data Workflows.

Strata CA 2013 talk "The Workflow Abstraction" by Paco Nathan about Cascading and related open source projects for building Enterprise Data Workflows.

Statistics

Views

Total Views
1,126
Views on SlideShare
1,114
Embed Views
12

Actions

Likes
5
Downloads
22
Comments
0

3 Embeds 12

http://lanyrd.com 5
http://www.feedspot.com 5
http://feeds.feedburner.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Workflow Abstraction The Workflow Abstraction Presentation Transcript

    • “The Workflow Abstraction” Strata SC 2013-02-28 Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc.Friday, 01 March 13 1Background: dual in quantitative and distributed systems.I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 2This talk is about the workflow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data WorkflowsWe’ll consider some theory, examples, best practices, trendlines --what are the drivers that brought us, and where is this work heading toward?Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
    • Marketing Funnel – overview In reference to Making Data Work… Customers Almost every business uses a model similar to this – give or take a few steps. Campaigns Customer leads go in at the top, Awareness those get refined through several stages, then results flow out the bottom. Interest Evalutation Conversion Referral RepeatFriday, 01 March 13 3Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
    • Marketing Funnel – clickstream Different funnel stages get represented in ecommerce by events captured in Customers log files, as a class of machine data called clickstream Campaigns Impression • ad impressions Awareness • URL clicks Click • landing page views Interest • new user registrations Sign Up Evalutation • session cookies Purchase • online purchases Conversion • social network activity "Like" • etc. Referral RepeatFriday, 01 March 13 4Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
    • Marketing Funnel – metrics A variety of clickstream metrics can be used as performance indicators Customers at different stages of the funnel: Campaigns • CPM: cost per thousand Impression • CTR: click-through rate Awareness CPM • CPA: cost per action Click • etc. Interest CTR Sign Up Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc.Friday, 01 March 13 5The many different highly-nuanced metrics which apply are mind-boggling :)
    • Marketing Funnel – example calculations Customers Campaigns Awareness Interest metric cost events formula rate Evalutation Conversion Referral Repeat $4,000 CPM $4,000 10^6 ÷ $4.00 (10^6 ÷ 10^3) 3∙10^3 CTR - 3∙10^3 ÷ 10^6 0.3% $4,000 CPA - 20 ÷ $200 20Friday, 01 March 13 6Here are examples of the kinds of calculations performed...
    • Marketing Funnel – predictive model Given these metrics, we can go further to estimate cost per paying user (CPP) Customers customer lifetime value (LTV), etc. Campaigns Then we can build a predictive model for return on investment (ROI) per customer, Awareness summarizing the funnel performance: ROI = (LTV − CPP) ∕ CPP Interest As an example, after crunching lots of logs, Evalutation suppose that… Conversion CPP = $200 LTV = $2000 Referral ROI = ($2000 − $200) ∕ $200 Repeat for a 9x multipleFriday, 01 March 13 7For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,which describes the efficiency of the marketing funnel at different stages.
    • Marketing Funnel – example architecture Customers Campaigns Customers Awareness Let’s consider an example architecture Interest Evalutation for calculating, reporting, and taking action Web Conversion on funnel metrics, based on large-scale App Referral Repeat clickstream data… logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingFriday, 01 March 13 8Here’s an example architecture of using clickstream metrics within an online business.
    • Marketing Funnel – complexities Multiple ad partners, different contracts terms, reporting different metrics at Customers × × different times, click scrubs, etc. Campaigns Campaigns target specific geo/demo, Impression × × test alternate landing pages, probably Awareness CPM need to segment customer base… Click These issues make clickstream data Interest CTR large and yet sparse. Sign Up Evalutation behaviors Other issues: × Purchase • seasonal variation Conversion CPA • fluctuating currency exchange rates "Like" Referral NPS, social graph, etc. • distortions due to credit card fraud • diminishing returns Repeat loyalty, win back, etc. • forecasting requirementsFriday, 01 March 13 9However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.scrubsmany vendors, data sources, different metrics to be alignedlots of roll-upsBayesian point estimatesforecasts and dashboardssocial dimension makes this convolutednot simple
    • Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. Interest CTR The need for these insights has been a Sign Up driver for Hadoop-related technologies. Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc.Friday, 01 March 13 10The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
    • Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. funnel modeling and optimization Interest CTR The need for these insights has been a Sign Up driver for Hadoop-relatedrequires complex data workflows technologies. Evalutation behaviors to obtain the required insights Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc.Friday, 01 March 13 11These needs imply complex data workflows.It’s not about doing a BI query or a pivot table;that’s how retailers were thinking when Amazon came along.
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 12A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
    • Circa 2008 – Hadoop at scale Customers Scenario: Analytics team at a large ad network… Campaigns Awareness Company had invested $MM capex in a Interest large data warehouse across LOBs Evalutation Conversion Mission-critical app had been written as Referral collab Repeat a large SQL workflow in the DW roll-ups filter Marketing funnel metrics were estimated for many advertisers, many campaigns, per-user recommends many publishers, many customers – billions of calculations daily query/load Predictive models matched publisher ~ advertiser clickstream RDBMS and campaign ~ user, to optimize marketing funnel performanceFriday, 01 March 13 13Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
    • Circa 2008 – Hadoop at scale Customers Issues: Campaigns Awareness • critical app had hit hard limits for scalability Interest • several Tb data, 100’s of servers Evalutation Conversion • batch window length vs. failure rate vs. SLA collab Referral Repeat in the context of business growth posed roll-ups filter an existential risk × We built out a team to address these issues per-user recommends as rapidly as possible… Needed to re-create that data workflows query/load based on Enterprise requirements. clickstream RDBMSFriday, 01 March 13 14Marching orders:5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;5 weeks to reverse engineer the mission-critical app without any access to its author;5 weeks to implement a Hadoop version which could scale-out on EC2.We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
    • Circa 2008 – Hadoop at scale Approach: roll-ups collab filter • reverse-engineered business process from ~1500 lines of undocumented SQL per-user • created a large, multi-step Apache Hadoop recommends app on AWS HDFS • leveraged cloud strategy to trade $MM capex for lower, scalable opex • Amazon identified our app as one of the msg queue largest Hadoop deployments on EC2 • our app became a case study for AWS query/load RDBMS prior to Elastic MapReduce launch clickstreamFriday, 01 March 13 15Our solution involved dependencies among more than a dozen Hadoop job steps.
    • Circa 2008 – Hadoop at scale × Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends HDFS • data scientists wore beepers since Ops × × lacked visibility into business process • coding directly in MapReduce created a staffing bottleneck msg queue query/load clickstream RDBMSFriday, 01 March 13 16This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --for troubleshooting, handling exceptions, notifications, etc.Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.Three issues about Enterprise workflows: * staffing bottleneck unless there’s a good abstraction layer * operational complexity, mostly due to lack of transparency * system integration problems *are* the main problem to solve
    • Circa 2008 – Hadoop at scale Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends • data scientists worea good since Ops for a large, commercial beepers solution HDFS lacked visibility into Apachebusiness logic deployment, but the app’s Hadoop • coding directly in MapReduce created a staffing bottleneck workflow management lacked crucial msg queue features… query/load which led to a search for a better clickstream RDBMS workflow abstractionFriday, 01 March 13 17While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 18Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
    • Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology.Friday, 01 March 13 19Cascading initially grew from interaction with the Nutch project, before Hadoop had a nameAPI author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
    • Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without an need to create an entirely new language • allows many programmers who have J2EE expertise to build apps that leverage the economics of Hadoop clustersFriday, 01 March 13 20Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
    • quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 infoworld.com/slideshow/65089Friday, 01 March 13 21Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”The issues: * staffing bottleneck * operational complexity * system integration
    • Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc.Friday, 01 March 13 22Several published case studies about Cascading, Cascalog, Scalding, etc.Wide range of use cases.Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with the various Hadoop distro vendors, cloud providers, etc.
    • examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wikiFriday, 01 March 13 23Many case studies, many Enterprise production deployments now for 5+ years.
    • examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascading as the basis for workflow abstractions atop Hadoop and more, Cascalog in Clojure (2010) Scalding in Scala (2012) with a 5+ year history of production deployments across multiple verticals github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wikiFriday, 01 March 13 24Cascading as a basis for workflow abstraction, for Enterprise data workflows
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 25Code samples in Cascading / Cascalog / Scalding, based on Word Count
    • The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems.Friday, 01 March 13 26Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
    • word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702Friday, 01 March 13 27Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
    • word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Friday, 01 March 13 28Based on a Cascading implementation of Word Count, here is sample code --approx 1/3 the code size of the Word Count example from Apache Hadoop2nd to last line: generates a DOT file for the flow diagram
    • word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Friday, 01 March 13 29As a concrete example of literate programming in Cascading,here is the DOT representation of the flow plan -- generated by the app itself.
    • word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/ImpatientFriday, 01 March 13 30Here is the same Word Count app written in Clojure, using Cascalog.
    • word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learnFriday, 01 March 13 31From what we see about language features, customer case studies, and best practices in general --Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.Great for large-scale, complex apps, where small teams must limit the complexities in their process.
    • word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true)) }Friday, 01 March 13 32Here is the same Word Count app written in Scala, using Scalding.Very compact, easy to understand; however, also more imperative than Cascalog.
    • word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level languageFriday, 01 March 13 33If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
    • word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level languageFriday, 01 March 13 34Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 35Tracking back to the Marketing Funnel as an example workflow…Let’s consider how Cascading apps incorporate other components beyond Hadoop
    • Enterprise Data Workflows Customers Back to our marketing funnel, let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingFriday, 01 March 13 36LOB use cases drive the demand for Big Data apps
    • Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingFriday, 01 March 13 37Enterprise organizations have seriously ginormous investments in existing back office practices:people, infrastructure, processes
    • Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingFriday, 01 March 13 38“Main Street” firms have invested in Hadoop to address Big Data needs,off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
    • Cascading workflows – taps • taps integrate other data frameworks, as tuple streams Customers • these are “plumbing” endpoints in the pattern language • sources (inputs), sinks (outputs), traps (exceptions) Web App • text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. logs logs Logs Cache • data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap • extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBs schema and provenance get Hadoop Prefs derived from analysis of the taps Reporting ClusterFriday, 01 March 13 39Speaking of system integration,taps provide the simplest approach for integrating different frameworks.
    • Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Friday, 01 March 13 40Here are the taps in the WordCount source
    • Cascading workflows – topologies • topologies execute workflows on clusters Customers • flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support • flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tap blend flows in different topologies tap Analytics into the same app – for example, Cubes customer Customer profile DBs batch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster ReportingFriday, 01 March 13 41Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
    • Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); topology wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Friday, 01 March 13 42Here is the flow planner for Hadoop in the WordCount source
    • example topologies…Friday, 01 March 13 43Here are some examples of topologies for distributed computing --Apache Hadoop being the first supported by Cascading,followed by local mode, and now a tuple space (IMDG) flow planner in the works.Several other widely used platforms would also be likely suspects for Cascading flow planners.
    • Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow • enable analysts without retraining sink tap source tap on Hadoop, etc. Analytics Cubes customer • transparency for Support, Ops, Hadoop Customer profile DBs Prefs Finance, et al. Reporting Cluster • a language for queries – not a database, but ANSI SQL as a DSL for workflowsFriday, 01 March 13 44ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.
    • ANSI SQL – CSV data in local file system cascading.org/lingualFriday, 01 March 13 45The test database for MySQL is available for download from https://launchpad.net/test-db/Here we have a bunch o’ CSV flat files in a directory in the local file system.Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
    • ANSI SQL – shell prompt, catalog cascading.org/lingualFriday, 01 March 13 46Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
    • ANSI SQL – queries cascading.org/lingualFriday, 01 March 13 47Here’s an example SQL query on that “employee” test database from MySQL.
    • Cascading workflows – machine learning • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • Cascading creates parallelized models Web App to run at scale on Hadoop clusters • Random Forest, Logistic Regression, logs logs Cache Logs GLM, Decision Trees, K-Means, Support Hierarchical Clustering, etc. trap source tap sink tap tap • integrate with other libraries Data (Matrix API, etc.) and great open Modeling PMML Workflow source tools (R, Weka, KNIME, sink tap source tap RapidMiner, etc.) Analytics Cubes customer • 2 lines of code or pre-built JAR Hadoop Customer profile DBs Prefs Cluster Reporting Run multiple variants of models as customer experimentsFriday, 01 March 13 48PMML has been around for a while, and export is supported by nearly every commercial analytics platform,covering a wide variety of predictive modeling algorithms.Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
    • model creation in R ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) cascading.org/patternFriday, 01 March 13 49Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.
    • model run at scale as a Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrix cascading.org/patternFriday, 01 March 13 50Conceptual flow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.
    • model run at scale as a Cascading app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } }Friday, 01 March 13 51Source code for a simple Cascading app that runs PMML models in general.
    • PMML support…Friday, 01 March 13 52Popular tools which can create predictive models for export as PMML
    • Cascading workflows – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes • TDD follows from Cascalog’s customer Customer profile DBs Prefs composable subqueries Hadoop Cluster Reporting • redirect traps in production to Ops, QA, Support, Audit, etc.Friday, 01 March 13 53TDD is not usually high on the list when people start discussing Big Data apps.The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application.Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
    • Cascading workflows – TDD meets API principles • specify what is required, not how it must be achieved Customers • plan far ahead, before consuming cluster Web App resources – fail fast prior to submit logs Cache • fail the same way twice – deterministic logs Logs Support flow planners help reduce engineering trap source sink tap costs for debugging at scale tap tap Data Modeling PMML • same JAR, any scale – app does not Workflow source require a recompile to change data sink tap tap taps or cluster topologies Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster ReportingFriday, 01 March 13 54Some of the design principles for the pattern language
    • Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Friday, 01 March 13 55Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
    • Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Friday, 01 March 13 56Hadoop is almost never used in isolation.Enterprise data workflows are about system integration.There are a couple different ways to arrive at the party.
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 57Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
    • Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language.Friday, 01 March 13 58A pattern language, based on the metaphor of “plumbing”
    • references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices. amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”. amazon.com/dp/0201633612Friday, 01 March 13 59Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
    • Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language.Friday, 01 March 13 60The pattern language provides a structured method for solving large,complex design problems where the syntax of the language promotesuse of best practices – which also addresses staffing issues
    • Cascading workflows – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps, great for cross-team collaboration.Friday, 01 March 13 61Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
    • references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”Friday, 01 March 13 62Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
    • examples… • Scalding apps have nearly 1:1 correspondence between function calls and the elements in their flow diagrams – excellent elision and literate representation • noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated [head] as a DOT file), sometimes in lieu of showing code Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] In formal terms, a flow diagram is a directed, acyclic [{1}:token] [{1}:token] graph (DAG) on which lots of interesting math applies GroupBy(wc)[by:[token]] for query optimization, predictive models about app wc[{1}:token] [{1}:token] reduce execution, parallel efficiency metrics, etc. Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Friday, 01 March 13 63Literate programming examples observed on the email list are some of the best illustrations of this methodology.
    • Cascading workflows – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) As a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale.Friday, 01 March 13 64Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
    • references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on: the process of structuring data That’s what apps do – Making Data WorkFriday, 01 March 13 65Focus on *the process of structuring data*which must happen before the large-scale joins, predictive models, visualizations, etc.Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
    • Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspnFriday, 01 March 13 66A more contemporary statement along similar lines...
    • Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a several theoretical aspects converge “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: into software engineering practices Moseley & Marks, 2006which mitigates the complexity of “Out of the Tar Pit” building and maintaining Enterprise goo.gl/SKspn data workflowsFriday, 01 March 13 67
    • The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 13 68Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.Where did Big Data come from, and where is this kind of work headed?
    • Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this.Friday, 01 March 13 69Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:parallelize workloads onto clusters of commodity servers to scale-out horizontally.Google and Inktomi (YHOO Search) were working along the same lines.
    • Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMSFriday, 01 March 13 70Ah, teh olde days - Perl and C++ for CGI :)Feedback loops shown in red represent data innovations at the time…Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos
    • Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMSFriday, 01 March 13 71Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of themarketing funnel, as in our case study.LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
    • Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMSFriday, 01 March 13 72Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.We see this feeding into cluster optimization in YARN, Mesos, etc.
    • Asymptotically… • long-term trends toward more instrumentation DSL of Enterprise data workflows: - workflow abstraction enables business cases Planner/ - more machine data collected about apps Optimizer - flow diagram (DAG) as unit of work (abstract type for machine data) Workflow - evolving feedback loops convert machine data App into actionable insights and optimizations History Cluster • industry moves beyond common needs of ad-hoc queries on logs and basic reporting, as a new class of complex data workflows emerges to provide Cluster the insights required by Enterprise Scheduler • end game is less about “bigness” of data, more about managing complexity in the process of structuring dataFriday, 01 March 13 73In summary…
    • references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9LFriday, 01 March 13 74Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
    • references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.htmlFriday, 01 March 13 75In their own words…
    • references… by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721Friday, 01 March 13 76Some of this material comes from an upcoming O’Reilly book:“Enterprise Data Workflows with Cascading”This should be in Rough Cuts soon -scheduled to be out in print this June.Many thanks to my wonderful editor, Courtney Nash.
    • drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com join us for very interesting work! Copyright @2013, Concurrent, Inc.Friday, 01 March 13 77Links to our open source projects, developer community, etc…contact me @pacoidhttp://concurrentinc.com/(were hiring too!)