Hadoop User Group EU 2014

  • 361 views
Uploaded on

 

More in: Software
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
361
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
11
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. AQUICKINTRODUCTIONTO THECASCADINGECOSYSTEM Chris K Wensel | Hadoop Summit EU 2014
  • 2. • Lead developer of the Cascading open-source project • Founder of Concurrent, Inc. • Involved with Apache Hadoop since it was called Apache Nutch ! • Systems Architect, not a Data Scientist WHOAMI? 2
  • 3. 3 For creating data oriented applications, frameworks, and languages [on Apache Hadoop] Originally designed to hide complexity of Hadoop and prevent thinking in MapReduce cascading.org
  • 4. • Started in 2007 • 2.0 released June 2012 • 2.5 out now • 3.0 WIP (if you look for it) • Apache 2.0 Licensed • Supports all Hadoop distros SOMESTATS 4
  • 5. 5 What’s it used for?
  • 6. 6 • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools • Easy to operationalize heavy lifting of data
  • 7. 7 • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US
  • 8. 8 • Scalding (Scala) • Machine learning (linear algebra) to improve • User experience • Ad quality (matching users and ad effectiveness) • All revenue applications are running on Cascading/Scalding • IPO TWITTER
  • 9. 9 • Estimate suicide risk from what people write online • Cascading + Cassandra • You can do more than optimize add yields • http://www.durkheimproject.org
  • 10. KEYPROJECTS 10 Lingual Pattern Cascading Apache Hadoop Scalding Cascalog
  • 11. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 11 Process Planner Processing API Integration API Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 12. • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc ‣ Rolling windows SOMECOMMONPATTERNS 12 filter filter function functionfilterfunction data Pipeline Split Join Merge data Topology
  • 13. 13 word count – Cascading Java API ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )!  .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide! wcFlow.complete(); // <<-- Runs jobs on Cluster 1 3 2 scheduling processing integration configuration
  • 14. 14 mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] wc.dot
  • 15. AREALWORLDAPP 15 [1/75] map+reduce [2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce [19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce [36/75] map+reduce [37/75] map+reduce [38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce [54/75] map+reduce [55/75] map [56/75] map+reduce [57/75] map[58/75] map [59/75] map [60/75] map [61/75] map[62/75] map [63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce [71/75] map [72/75] map [73/75] map+reduce [74/75] map+reduce [75/75] map+reduce 1 App, 1 Flow, 75 Steps/MRJobs ! green = map + reduce purple = map blue = join/merge orange = map split A graph of jobs, not operations!
  • 16. 16 It’s not just for Java
  • 17. 17 word count – Scalding (Scala) // Sujit Pal! // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html! ! package com.mycompany.impatient! ! import com.twitter.scalding._! ! class Part2(args : Args) extends Job(args) {!   val input = Tsv(args("input"), ('docId, 'text))!   val output = Tsv(args("output"))!   input.read.!     flatMap('text -> 'word) {! text : String => text.split("""s+""")! }.!     groupBy('word) { group => group.size }.!     write(output)! }!
  • 18. 18 word count – Cascalog (Clojure) ; Paul Lam! ; github.com/Quantisan/Impatient! ! (ns impatient.core!   (:use [cascalog.api]!         [cascalog.more-taps :only (hfs-delimited)])!   (:require [clojure.string :as s]!             [cascalog.ops :as c])!   (:gen-class))! ! (defmapcatop split [line]!   "reads in a line of string and splits it by regex"!   (s/split line #"[[](),.)s]+"))! ! (defn -main [in out & args]!   (?<- (hfs-delimited out)!        [?word ?count]!        ((hfs-delimited in :skip-header? true) _ ?line)!        (split ?line :> ?word)!        (c/count ?count)))!
  • 19. • Step by step tutorials on Cascading on GitHub • Community has ported them to Scalding and Cascalog ! • http://docs.cascading.org/impatient/ “FORTHEIMPATIENT”SERIES 19
  • 20. • Foundation of patterns and best practices for building Languages, Frameworks, and Applications • Designed to abstract Hadoop away from the business logic • Other models than MapReduce on the way! WHYCASCADING? 20
  • 21. • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 21 Query Planner JDBC API Lingual APIProvider API Cascading Apache HadoopLingual Data Stores CLI / Shell Enterprise Java Catalog
  • 22. 22 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
  • 23. 23 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
  • 24. SHELL-!TABLES 24
  • 25. 25 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev- jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
  • 26. 26 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
  • 27. 27 “But we use a custom data format”
  • 28. • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster DATAPROVIDERAPI 28
  • 29. 29 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results
  • 30. • Quickly migrate existing work loads from RDBMS to Hadoop • Quickly extract data from Hadoop into applications WHYLINGUAL 30
  • 31. • Predictive model scoring • Java API and PMML parser • Supports: ‣ (General) Regression ‣ Clustering ‣ Decisions Trees ‣ Random Forest ‣ and ensembles of models PATTERN 31 PMML Parser Pattern API Cascading Apache Hadoop Pattern Data Stores Enterprise Java
  • 32. 32 ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  • 33. • Standards compliance provides integration with many tools • Models are independent of data and integration • Only debugging Cascading, not an ensemble of applications WHYPATTERN 33
  • 34. CLOSINGTHELOOP 34 Cluster Pattern Desktop Job PMML Flow JDBC Flow import data create models export models execute models import results JDBC Flow PMML DATA DATA test results Job Job
  • 35. • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 35 http://cascading.io/driven/
  • 36. MANAGEDWITHDRIVEN 36
  • 37. 37
  • 38. • New query planner ‣ User definable Assertion and Transformation rules ‣ Sub-Graph Isomorphism Pattern Matching ‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75 • Hadoop Tez support • And likely other platforms CASCADING3.0 38
  • 39. THERE’SABOOK! 39 Enterprise DataWorkflows with Cascading - Paco Nathan O’Reilly, 2013 amazon.com/dp/1449358721
  • 40. CONTACT 40 @cwensel | @cascading chris@wensel.net www.cascading.org www.concurrentinc.com