Hadoop Summit EU   2014
Upcoming SlideShare
Loading in...5
×
 

Hadoop Summit EU 2014

on

  • 349 views

 

Statistics

Views

Total Views
349
Slideshare-icon Views on SlideShare
348
Embed Views
1

Actions

Likes
3
Downloads
9
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Summit EU   2014 Hadoop Summit EU 2014 Presentation Transcript

    • DATAINTEGRATIONAND SQLAPPLICATIONMIGRATION WITH CASCADINGLINGUAL Chris K Wensel | Hadoop Summit EU 2014
    • • Not a “data scientist” • No idea what “big data” means • Used MR in anger once, and did it wrong • Author of Cascading • Co-Author of Lingual (w/ Julian Hyde) CHRISKWENSEL 2
    • 3 Why is Hadoop & “big data” a thing?
    • More is better HADOOP&BIGDATA 4
    • More Data More Machines More Algorithms More Tools HADOOP&BIGDATA 5
    • Worse is better HADOOP&BIGDATA 6
    • Less red tape More degrees of freedom No upfront design HADOOP&BIGDATA 7
    • 8 Why Cascading?
    • Makes hard things possible. CASCADING 9
    • While helping to retain Conceptual Integrity. CASCADING 10
    • "the speed of innovation is proportional to the arrival rate of answers to questions" HADOOP&BIGDATA 11
    • True when you are questioning Data, Algorithms, and Architecture CASCADING 12
    • • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 13 Process Planner Processing API Integration API Scheduler API Scheduler Compute Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
    • ECOSYSTEM 14 Lingual Pattern Cascading Hadoop MR Scalding Cascalog Hadoop Tez Whatever
    • • Started in 2007 • 2.0 released June 2012 • 2.5 stable out now • 3.0 wip now available • Tez support coming soon • Apache Licensed Open-Source • Supports all Hadoop 1 & 2 distros CASCADING 15
    • ANSI SQL on Cascading on Whatever LINGUAL 16
    • How’s this different than all the other “SQL for Hadoop” projects? LINGUAL 17
    • Not intended as an ad-hoc query interface. [Lingual is only as fast as Hadoop] WHYLINGUAL? 18
    • Is intended to be as standards compliant as possible. WHYLINGUAL? 19
    • Migrate workloads from expensive systems to less expensive Hadoop WHYLINGUAL? 20
    • Liberate the data trapped on Hadoop w/o involving an Engineer WHYLINGUAL? 21
    • • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 22 Query Planner JDBC API Lingual APIProvider API Cascading Compute Lingual Data Stores CLI / Shell Enterprise Java Catalog
    • • SQL-92 • Character, Numeric, and Temporal types • IN and CASE • FROM sub-queries • CAST and CONVERT • CURRENT_* ANSISQL 23 http://docs.cascading.org/lingual/1.1/#sql-support
    • 24 query:      {              select      |      query  UNION  [  ALL  ]  query      |      query  EXCEPT  query      |      query  INTERSECT  query      }      [  ORDER  BY  orderItem  [,  orderItem  ]*  ]      [  LIMIT  {  count  |  ALL  }  ]      [  OFFSET  start  {  ROW  |  ROWS  }  ]      [  FETCH  {  FIRST  |  NEXT  }  [  count  ]  {  ROW  |  ROWS  }  ]   ! orderItem:      expression  [  ASC  |  DESC  ]  [  NULLS  FIRST  |  NULLS  LAST  ]   ! select:      SELECT  [  ALL  |  DISTINCT  ]              {  *  |  projectItem  [,  projectItem  ]*  }      FROM  tableExpression      [  WHERE  booleanExpression  ]      [  GROUP  BY  {  ()  |  expression  [,  expression]*  }  ]      [  HAVING  booleanExpression  ]      [  WINDOW  windowName  AS  windowSpec  [,  windowName  AS  windowSpec  ]*  ]   ! projectItem:              expression  [  [  AS  ]  columnAlias  ]      |      tableAlias  .  *   tableExpression:              tableReference  [,  tableReference  ]*      |      tableExpression  [  NATURAL  ]  [  LEFT  |  RIGHT  |  FULL  ]                    JOIN  tableExpression  [  joinCondition  ]   ! joinCondition:              ON  booleanExpression      |      USING  (  column  [,  column  ]*  )   ! tableReference:      tablePrimary  [  [  AS  ]  alias  [  (  columnAlias  [,  columnAlias  ]*  )  ]  ]   ! tablePrimary:              [  TABLE  ]  [  [  catalogName  .  ]  schemaName  .  ]  tableName      |      (  query  )      |      VALUES  expression  [,  expression  ]*      |      (  TABLE  expression  )   ! windowRef:              windowName      |      windowSpec   ! windowSpec:      [  windowName  ]      (              [  ORDER  BY  orderItem  [,  orderItem  ]*  ]              [  PARTITION  BY  expression  [,  expression  ]*  ]              {                      RANGE  numericOrInterval  {  PRECEDING  |  FOLLOWING  }              |                      ROWS  numeric  {  PRECEDING  |  FOLLOWING  }              }      ) Lingual 1.1 -> Optiq 0.4.12.3 https://github.com/julianhyde/optiq/blob/master/REFERENCE.md
    • Lingual provides two interfaces. APIS 25
    • Allows SQL and non-SQL Flows to work together as a single application via conceptually similar interfaces CASCADINGAPI 26
    • 27 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! Flow  flow  =  new  HadoopFlowConnector().connect(  flowDef  );   ! flow.complete();
    • So Systems and People can talk directly to Hadoop visible data JDBCAPI 28
    • 29 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
    • JDBC 30 Server / Desktop JDBC FlowAssembly Cluster JobJobSQL select * from employees ... SQL select * from employees ... SQL select * from employees ... lingual-hadoop-1.1.0-jdbc.jar meta-data catalog
    • DEFAULTSHELL 31
    • select dept_no, avg( max_salary ) from employees.dept_emp, ( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries group by emp_no ) where dept_emp.emp_no = sal_emp_no group by dept_no; SUB-QUERY 32
    • ACCESSHADOOPFROMR 33 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
    • RESULTS 34 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
    • INTEGRATION 35 But I use a custom data format!
    • • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster, on the fly DATAPROVIDERAPI 36
    • DATAPROVIDER 37 JDBC Maven Repo Assembly Flow Cluster JobJob lingual-hadoop-1.1.0-jdbc.jar cascading-jdbc-oracle-provider.jar your-avro-provider.jar
    • AMAZONEMR&REDSHIFT 38 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results http://docs.cascading.org/tutorials/lingual-redshift/
    • All Cascading applications can be visualized and monitored … MANAGED 39
    • • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 40 http://cascading.io/driven/
    • MANAGEDWITHDRIVEN 41
    • 42
    • ABOOK! 43 Enterprise DataWorkflows
 with Cascading O’Reilly, 2013 amazon.com/dp/1449358721
    • chris@concurrentinc.com ! ! @cwensel DONE 44