Cascading for the Impatient
 

Cascading for the Impatient

on

  • 2,075 views

Lightening talk version of "Cascading for the Impatient" http://cascading.org/category/impatient/

Lightening talk version of "Cascading for the Impatient" http://cascading.org/category/impatient/

Statistics

Views

Total Views
2,075
Views on SlideShare
2,048
Embed Views
27

Actions

Likes
8
Downloads
47
Comments
1

4 Embeds 27

http://www.linkedin.com 19
https://www.linkedin.com 4
http://zest.to 2
https://twitter.com 2

Accessibility

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Jogando.net 02
    o MELHOR SERVIDOR DE MU ONLINE DO BRASIL! Season 6 Ep. 3
    em todos os Servers. Sendo 7 servers diferenciados proporcionando sua diversão.
    Conheça também o site de Animes Cloud: www.animescloud.com ,mais de 20.000 videos online. ENTRE JÁ NO SITE : www.jogando.net/mu/ >> CADASTRE-SE E GANHE 5 DIAS DE VIP 1ª Mega Maratona Jogando.net ~> MAIS DE 30 DIAS DE EVENTOS .
    Curta nossa página no Facebook : www.facebook.com/pages/jogandonet/3710275296185
    By: talula
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n

Cascading for the Impatient Cascading for the Impatient Presentation Transcript

  • Cascading for the ImpatientPaco Nathan Document CollectionConcurrent, Inc. Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHSpnathan@concurrentinc.com Count Word Count@pacoid Copyright @2012, Concurrent, Inc.
  • why? Unstructured Data meets Enterprise Scale
  • how? Cascading.org/ Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • who? • Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) • Systems Integrator POV: data sources and compute platforms system integration of heterogenous • Data Scientist graph (DAG) on which we can apply Amdahls Law a directed, acyclic POV: • Data Architect large-scale data flow management a physical plan for POV: • Software Architect POV:plumbing or circuit design a pattern language, similar to • API bindings for Scala, Clojure, Python, Ruby, Java App Developer POV: Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS • Systemshas passed CI, available in a Maven repo a JAR file, Engineer POV: Count Word Count
  • where? business Domain expertise, business trade-offs, process operating parameters, etc. API Scala, Clojure, Python, Ruby, Java, etc. language …envision whatever else runs in a JVM logical plan (raw human intellect, unless…) / optimize Document Collection Scrub Tokenize token M physical Stop Word HashJoin Left Regex token GroupBy token R plan List RHS Count Word Count compute Apache Hadoop, in-memory local mode framework …envision GPUs, other frameworks, etc. “assembler” code monitors, Nagios, etc. notification
  • 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );     // create the sink tap M     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );     // specify a pipe to connect the taps Sink     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete();1 mapper     }   }0 reducers10 lines code
  • wait! ten lines of code for a file copy … seems like a lot.
  • same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR + 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Mom’s Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 2: word countDocumentCollection Tokenize GroupBy M token Count R Word Count1 mapper1 reducer18 lines code
  • 3: wc + scrubDocumentCollection Scrub GroupBy Tokenize token token Count M R Word Count1 mapper1 reducer22+10 lines code
  • 4: wc + scrub + stop wordsDocumentCollection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count1 mapper Word1 reducer Count28+10 lines code
  • 5: tf-idf Unique Insert SumBy D doc_id 1 doc_idDocumentCollection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_idDocumentCollection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • deployed… elastic-mapreduce --create --name "TF-IDF" --jar s3n://temp.cascading.org/impatient/part6.jar --arg s3n://temp.cascading.org/impatient/rain.txt --arg s3n://temp.cascading.org/impatient/out/wc --arg s3n://temp.cascading.org/impatient/en.stop --arg s3n://temp.cascading.org/impatient/out/tfidf --arg s3n://temp.cascading.org/impatient/out/trap --arg s3n://temp.cascading.org/impatient/out/check
  • results? doc_id tf-idf doc02 0.9163 token air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 californias doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 deserts doc03 0.9163 downwinddoc_id text …doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinkingdoc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 suchwith less rain and cloudcover. doc04 0.9163 valleydoc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 womenside of a mountain. doc03 0.5108 landdoc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 landdeserts of mountain ranges, such as Californias Death Valley. doc01 0.5108 leedoc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 leezoink null doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry doc02 0.2231 mountain doc03 0.2231 mountain doc04 0.2231 mountain doc01 0.0000 rain doc02 0.0000 rain doc03 0.0000 rain doc04 0.0000 rain doc01 0.0000 shadow doc02 0.0000 shadow doc03 0.0000 shadow doc04 0.0000 shadow
  • comparisons? compare similar code in Scalding and Cascalog: sujitpal.blogspot.com/2012/08/scalding-for-impatient.html based on: github.com/twitter/scalding/wiki github.com/Quantisan/Impatient based on: github.com/nathanmarz/cascalog/wiki
  • drill-down? blog, code, wiki, gists, jars, list, DevOps products: cascading.org/category/impatient/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/