AQUICKINTRODUCTIONTO
THECASCADINGECOSYSTEM
Chris K Wensel | Hadoop Summit EU 2014
• Lead developer of the Cascading open-source project
• Founder of Concurrent, Inc.
• Involved with Apache Hadoop since it...
3
For creating data oriented applications, frameworks,
and languages [on Apache Hadoop]
Originally designed to hide comple...
• Started in 2007
• 2.0 released June 2012
• 2.5 out now
• 3.0 WIP (if you look for it)
• Apache 2.0 Licensed
• Supports a...
5
What’s it used for?
6
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools
• Ea...
7
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machi...
8
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad ...
9
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields...
KEYPROJECTS
10
Lingual Pattern
Cascading
Apache Hadoop
Scalding Cascalog
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle sta...
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary ...
13
word count – Cascading Java API	

!
String docPath = args[ 0 ];!
String wcPath = args[ 1 ];!
Properties properties = ne...
14
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('...
AREALWORLDAPP
15
[1/75] map+reduce
[2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduc...
16
It’s not just for Java
17
word count – Scalding (Scala)	

// Sujit Pal!
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html!
!
package c...
18
word count – Cascalog (Clojure)	

; Paul Lam!
; github.com/Quantisan/Impatient!
!
(ns impatient.core!
  (:use [cascalog...
• Step by step tutorials on Cascading on GitHub
• Community has ported them to Scalding and Cascalog
!
• http://docs.casca...
• Foundation of patterns and best practices for building
Languages, Frameworks, and Applications
• Designed to abstract Ha...
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LI...
22
Cascading API	

!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "sqlflow" )!
.addSource( "example.employee", emplTap )...
23
JDBC driver	

public void run() throws ClassNotFoundException, SQLException {!
Class.forName( "cascading.lingual.jdbc.D...
SHELL-!TABLES
24
25
# load the JDBC package!
library(RJDBC)!
 !
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/...
26
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
27
“But we use a custom data format”
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-j...
29
Amazon Elastic MapReduce
Job Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedSh...
• Quickly migrate existing work loads from RDBMS to Hadoop
• Quickly extract data from Hadoop into applications
WHYLINGUAL...
• Predictive model scoring
• Java API and PMML parser
• Supports:
‣ (General) Regression
‣ Clustering
‣ Decisions Trees
‣ ...
32
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify...
• Standards compliance provides integration with many tools
• Models are independent of data and integration
• Only debugg...
CLOSINGTHELOOP
34
Cluster
Pattern
Desktop
Job
PMML
Flow
JDBC
Flow
import data
create models
export models
execute models
i...
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the ...
MANAGEDWITHDRIVEN
36
37
• New query planner
‣ User definable Assertion and Transformation rules
‣ Sub-Graph Isomorphism Pattern Matching
‣ Cordella...
THERE’SABOOK!
39
Enterprise DataWorkflows with Cascading	

- Paco Nathan	

O’Reilly, 2013	

amazon.com/dp/1449358721
CONTACT
40
@cwensel | @cascading	

chris@wensel.net	

www.cascading.org	

www.concurrentinc.com
Upcoming SlideShare
Loading in …5
×

Hadoop User Group EU 2014

609 views
540 views

Published on

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
609
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop User Group EU 2014

  1. 1. AQUICKINTRODUCTIONTO THECASCADINGECOSYSTEM Chris K Wensel | Hadoop Summit EU 2014
  2. 2. • Lead developer of the Cascading open-source project • Founder of Concurrent, Inc. • Involved with Apache Hadoop since it was called Apache Nutch ! • Systems Architect, not a Data Scientist WHOAMI? 2
  3. 3. 3 For creating data oriented applications, frameworks, and languages [on Apache Hadoop] Originally designed to hide complexity of Hadoop and prevent thinking in MapReduce cascading.org
  4. 4. • Started in 2007 • 2.0 released June 2012 • 2.5 out now • 3.0 WIP (if you look for it) • Apache 2.0 Licensed • Supports all Hadoop distros SOMESTATS 4
  5. 5. 5 What’s it used for?
  6. 6. 6 • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools • Easy to operationalize heavy lifting of data
  7. 7. 7 • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US
  8. 8. 8 • Scalding (Scala) • Machine learning (linear algebra) to improve • User experience • Ad quality (matching users and ad effectiveness) • All revenue applications are running on Cascading/Scalding • IPO TWITTER
  9. 9. 9 • Estimate suicide risk from what people write online • Cascading + Cassandra • You can do more than optimize add yields • http://www.durkheimproject.org
  10. 10. KEYPROJECTS 10 Lingual Pattern Cascading Apache Hadoop Scalding Cascalog
  11. 11. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 11 Process Planner Processing API Integration API Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  12. 12. • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc ‣ Rolling windows SOMECOMMONPATTERNS 12 filter filter function functionfilterfunction data Pipeline Split Join Merge data Topology
  13. 13. 13 word count – Cascading Java API ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )!  .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide! wcFlow.complete(); // <<-- Runs jobs on Cluster 1 3 2 scheduling processing integration configuration
  14. 14. 14 mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] wc.dot
  15. 15. AREALWORLDAPP 15 [1/75] map+reduce [2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce [19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce [36/75] map+reduce [37/75] map+reduce [38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce [54/75] map+reduce [55/75] map [56/75] map+reduce [57/75] map[58/75] map [59/75] map [60/75] map [61/75] map[62/75] map [63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce [71/75] map [72/75] map [73/75] map+reduce [74/75] map+reduce [75/75] map+reduce 1 App, 1 Flow, 75 Steps/MRJobs ! green = map + reduce purple = map blue = join/merge orange = map split A graph of jobs, not operations!
  16. 16. 16 It’s not just for Java
  17. 17. 17 word count – Scalding (Scala) // Sujit Pal! // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html! ! package com.mycompany.impatient! ! import com.twitter.scalding._! ! class Part2(args : Args) extends Job(args) {!   val input = Tsv(args("input"), ('docId, 'text))!   val output = Tsv(args("output"))!   input.read.!     flatMap('text -> 'word) {! text : String => text.split("""s+""")! }.!     groupBy('word) { group => group.size }.!     write(output)! }!
  18. 18. 18 word count – Cascalog (Clojure) ; Paul Lam! ; github.com/Quantisan/Impatient! ! (ns impatient.core!   (:use [cascalog.api]!         [cascalog.more-taps :only (hfs-delimited)])!   (:require [clojure.string :as s]!             [cascalog.ops :as c])!   (:gen-class))! ! (defmapcatop split [line]!   "reads in a line of string and splits it by regex"!   (s/split line #"[[](),.)s]+"))! ! (defn -main [in out & args]!   (?<- (hfs-delimited out)!        [?word ?count]!        ((hfs-delimited in :skip-header? true) _ ?line)!        (split ?line :> ?word)!        (c/count ?count)))!
  19. 19. • Step by step tutorials on Cascading on GitHub • Community has ported them to Scalding and Cascalog ! • http://docs.cascading.org/impatient/ “FORTHEIMPATIENT”SERIES 19
  20. 20. • Foundation of patterns and best practices for building Languages, Frameworks, and Applications • Designed to abstract Hadoop away from the business logic • Other models than MapReduce on the way! WHYCASCADING? 20
  21. 21. • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 21 Query Planner JDBC API Lingual APIProvider API Cascading Apache HadoopLingual Data Stores CLI / Shell Enterprise Java Catalog
  22. 22. 22 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
  23. 23. 23 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
  24. 24. SHELL-!TABLES 24
  25. 25. 25 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev- jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
  26. 26. 26 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
  27. 27. 27 “But we use a custom data format”
  28. 28. • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster DATAPROVIDERAPI 28
  29. 29. 29 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results
  30. 30. • Quickly migrate existing work loads from RDBMS to Hadoop • Quickly extract data from Hadoop into applications WHYLINGUAL 30
  31. 31. • Predictive model scoring • Java API and PMML parser • Supports: ‣ (General) Regression ‣ Clustering ‣ Decisions Trees ‣ Random Forest ‣ and ensembles of models PATTERN 31 PMML Parser Pattern API Cascading Apache Hadoop Pattern Data Stores Enterprise Java
  32. 32. 32 ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  33. 33. • Standards compliance provides integration with many tools • Models are independent of data and integration • Only debugging Cascading, not an ensemble of applications WHYPATTERN 33
  34. 34. CLOSINGTHELOOP 34 Cluster Pattern Desktop Job PMML Flow JDBC Flow import data create models export models execute models import results JDBC Flow PMML DATA DATA test results Job Job
  35. 35. • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 35 http://cascading.io/driven/
  36. 36. MANAGEDWITHDRIVEN 36
  37. 37. 37
  38. 38. • New query planner ‣ User definable Assertion and Transformation rules ‣ Sub-Graph Isomorphism Pattern Matching ‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75 • Hadoop Tez support • And likely other platforms CASCADING3.0 38
  39. 39. THERE’SABOOK! 39 Enterprise DataWorkflows with Cascading - Paco Nathan O’Reilly, 2013 amazon.com/dp/1449358721
  40. 40. CONTACT 40 @cwensel | @cascading chris@wensel.net www.cascading.org www.concurrentinc.com

×