Hadoop Summit EU 2014

721 views
611 views

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
721
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Hadoop Summit EU 2014

  1. 1. DATAINTEGRATIONAND SQLAPPLICATIONMIGRATION WITH CASCADINGLINGUAL Chris K Wensel | Hadoop Summit EU 2014
  2. 2. • Not a “data scientist” • No idea what “big data” means • Used MR in anger once, and did it wrong • Author of Cascading • Co-Author of Lingual (w/ Julian Hyde) CHRISKWENSEL 2
  3. 3. 3 Why is Hadoop & “big data” a thing?
  4. 4. More is better HADOOP&BIGDATA 4
  5. 5. More Data More Machines More Algorithms More Tools HADOOP&BIGDATA 5
  6. 6. Worse is better HADOOP&BIGDATA 6
  7. 7. Less red tape More degrees of freedom No upfront design HADOOP&BIGDATA 7
  8. 8. 8 Why Cascading?
  9. 9. Makes hard things possible. CASCADING 9
  10. 10. While helping to retain Conceptual Integrity. CASCADING 10
  11. 11. "the speed of innovation is proportional to the arrival rate of answers to questions" HADOOP&BIGDATA 11
  12. 12. True when you are questioning Data, Algorithms, and Architecture CASCADING 12
  13. 13. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 13 Process Planner Processing API Integration API Scheduler API Scheduler Compute Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  14. 14. ECOSYSTEM 14 Lingual Pattern Cascading Hadoop MR Scalding Cascalog Hadoop Tez Whatever
  15. 15. • Started in 2007 • 2.0 released June 2012 • 2.5 stable out now • 3.0 wip now available • Tez support coming soon • Apache Licensed Open-Source • Supports all Hadoop 1 & 2 distros CASCADING 15
  16. 16. ANSI SQL on Cascading on Whatever LINGUAL 16
  17. 17. How’s this different than all the other “SQL for Hadoop” projects? LINGUAL 17
  18. 18. Not intended as an ad-hoc query interface. [Lingual is only as fast as Hadoop] WHYLINGUAL? 18
  19. 19. Is intended to be as standards compliant as possible. WHYLINGUAL? 19
  20. 20. Migrate workloads from expensive systems to less expensive Hadoop WHYLINGUAL? 20
  21. 21. Liberate the data trapped on Hadoop w/o involving an Engineer WHYLINGUAL? 21
  22. 22. • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 22 Query Planner JDBC API Lingual APIProvider API Cascading Compute Lingual Data Stores CLI / Shell Enterprise Java Catalog
  23. 23. • SQL-92 • Character, Numeric, and Temporal types • IN and CASE • FROM sub-queries • CAST and CONVERT • CURRENT_* ANSISQL 23 http://docs.cascading.org/lingual/1.1/#sql-support
  24. 24. 24 query:      {              select      |      query  UNION  [  ALL  ]  query      |      query  EXCEPT  query      |      query  INTERSECT  query      }      [  ORDER  BY  orderItem  [,  orderItem  ]*  ]      [  LIMIT  {  count  |  ALL  }  ]      [  OFFSET  start  {  ROW  |  ROWS  }  ]      [  FETCH  {  FIRST  |  NEXT  }  [  count  ]  {  ROW  |  ROWS  }  ]   ! orderItem:      expression  [  ASC  |  DESC  ]  [  NULLS  FIRST  |  NULLS  LAST  ]   ! select:      SELECT  [  ALL  |  DISTINCT  ]              {  *  |  projectItem  [,  projectItem  ]*  }      FROM  tableExpression      [  WHERE  booleanExpression  ]      [  GROUP  BY  {  ()  |  expression  [,  expression]*  }  ]      [  HAVING  booleanExpression  ]      [  WINDOW  windowName  AS  windowSpec  [,  windowName  AS  windowSpec  ]*  ]   ! projectItem:              expression  [  [  AS  ]  columnAlias  ]      |      tableAlias  .  *   tableExpression:              tableReference  [,  tableReference  ]*      |      tableExpression  [  NATURAL  ]  [  LEFT  |  RIGHT  |  FULL  ]                    JOIN  tableExpression  [  joinCondition  ]   ! joinCondition:              ON  booleanExpression      |      USING  (  column  [,  column  ]*  )   ! tableReference:      tablePrimary  [  [  AS  ]  alias  [  (  columnAlias  [,  columnAlias  ]*  )  ]  ]   ! tablePrimary:              [  TABLE  ]  [  [  catalogName  .  ]  schemaName  .  ]  tableName      |      (  query  )      |      VALUES  expression  [,  expression  ]*      |      (  TABLE  expression  )   ! windowRef:              windowName      |      windowSpec   ! windowSpec:      [  windowName  ]      (              [  ORDER  BY  orderItem  [,  orderItem  ]*  ]              [  PARTITION  BY  expression  [,  expression  ]*  ]              {                      RANGE  numericOrInterval  {  PRECEDING  |  FOLLOWING  }              |                      ROWS  numeric  {  PRECEDING  |  FOLLOWING  }              }      ) Lingual 1.1 -> Optiq 0.4.12.3 https://github.com/julianhyde/optiq/blob/master/REFERENCE.md
  25. 25. Lingual provides two interfaces. APIS 25
  26. 26. Allows SQL and non-SQL Flows to work together as a single application via conceptually similar interfaces CASCADINGAPI 26
  27. 27. 27 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! Flow  flow  =  new  HadoopFlowConnector().connect(  flowDef  );   ! flow.complete();
  28. 28. So Systems and People can talk directly to Hadoop visible data JDBCAPI 28
  29. 29. 29 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
  30. 30. JDBC 30 Server / Desktop JDBC FlowAssembly Cluster JobJobSQL select * from employees ... SQL select * from employees ... SQL select * from employees ... lingual-hadoop-1.1.0-jdbc.jar meta-data catalog
  31. 31. DEFAULTSHELL 31
  32. 32. select dept_no, avg( max_salary ) from employees.dept_emp, ( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries group by emp_no ) where dept_emp.emp_no = sal_emp_no group by dept_no; SUB-QUERY 32
  33. 33. ACCESSHADOOPFROMR 33 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
  34. 34. RESULTS 34 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
  35. 35. INTEGRATION 35 But I use a custom data format!
  36. 36. • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster, on the fly DATAPROVIDERAPI 36
  37. 37. DATAPROVIDER 37 JDBC Maven Repo Assembly Flow Cluster JobJob lingual-hadoop-1.1.0-jdbc.jar cascading-jdbc-oracle-provider.jar your-avro-provider.jar
  38. 38. AMAZONEMR&REDSHIFT 38 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results http://docs.cascading.org/tutorials/lingual-redshift/
  39. 39. All Cascading applications can be visualized and monitored … MANAGED 39
  40. 40. • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 40 http://cascading.io/driven/
  41. 41. MANAGEDWITHDRIVEN 41
  42. 42. 42
  43. 43. ABOOK! 43 Enterprise DataWorkflows
 with Cascading O’Reilly, 2013 amazon.com/dp/1449358721
  44. 44. chris@concurrentinc.com ! ! @cwensel DONE 44

×