Hadoop Summit EU 2014

DATAINTEGRATIONAND
SQLAPPLICATIONMIGRATION
WITH
CASCADINGLINGUAL
Chris K Wensel | Hadoop Summit EU 2014

• Not a “data scientist”
• No idea what “big data” means
• Used MR in anger once, and did it wrong
• Author of Cascading
• Co-Author of Lingual (w/ Julian Hyde)
CHRISKWENSEL
2

3
Why is Hadoop & “big data” a thing?

More is better
HADOOP&BIGDATA
4

More Data
More Machines
More Algorithms
More Tools
HADOOP&BIGDATA
5

Worse is better
HADOOP&BIGDATA
6

Less red tape
More degrees of freedom
No upfront design
HADOOP&BIGDATA
7

Makes hard things possible.
CASCADING
9

While helping to retain
Conceptual Integrity.
CASCADING
10

"the speed of innovation is
proportional to the arrival rate of
answers to questions"
HADOOP&BIGDATA
11

True when you are questioning
Data, Algorithms, and
Architecture
CASCADING
12

• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
13
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Compute
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java

ECOSYSTEM
14
Lingual Pattern
Cascading
Hadoop MR
Scalding Cascalog
Hadoop Tez Whatever

• Started in 2007
• 2.0 released June 2012
• 2.5 stable out now
• 3.0 wip now available
• Tez support coming soon
• Apache Licensed Open-Source
• Supports all Hadoop 1 & 2 distros
CASCADING
15

ANSI SQL
on Cascading
on Whatever
LINGUAL
16

How’s this different than all the
other “SQL for Hadoop” projects?
LINGUAL
17

Not intended as
an ad-hoc query interface.
[Lingual is only as fast as Hadoop]
WHYLINGUAL?
18

Is intended to be
as standards compliant as
possible.
WHYLINGUAL?
19

Migrate workloads from expensive systems
to less expensive Hadoop
WHYLINGUAL?
20

Liberate the data trapped on Hadoop w/o
involving an Engineer
WHYLINGUAL?
21

• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
22
Query Planner
JDBC API Lingual APIProvider API
Cascading
Compute
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog

• SQL-92
• Character, Numeric, and Temporal types
• IN and CASE
• FROM sub-queries
• CAST and CONVERT
• CURRENT_*
ANSISQL
23
http://docs.cascading.org/lingual/1.1/#sql-support

24
query:

{

select

|

query
UNION
[
ALL
]
query

|

query
EXCEPT
query

|

query
INTERSECT
query

}

[
ORDER
BY
orderItem
[,
orderItem
]*
]

[
LIMIT
{
count
|
ALL
}
]

[
OFFSET
start
{
ROW
|
ROWS
}
]

[
FETCH
{
FIRST
|
NEXT
}
[
count
]
{
ROW
|
ROWS
}
]

!
orderItem:

expression
[
ASC
|
DESC
]
[
NULLS
FIRST
|
NULLS
LAST
]

!
select:

SELECT
[
ALL
|
DISTINCT
]

{
*
|
projectItem
[,
projectItem
]*
}

FROM
tableExpression

[
WHERE
booleanExpression
]

[
GROUP
BY
{
()
|
expression
[,
expression]*
}
]

[
HAVING
booleanExpression
]

[
WINDOW
windowName
AS
windowSpec
[,
windowName
AS
windowSpec
]*
]

!
projectItem:

expression
[
[
AS
]
columnAlias
]

|

tableAlias
.
*

tableExpression:

tableReference
[,
tableReference
]*

|

tableExpression
[
NATURAL
]
[
LEFT
|
RIGHT
|
FULL
]

JOIN
tableExpression
[
joinCondition
]

!
joinCondition:

ON
booleanExpression

|

USING
(
column
[,
column
]*
)

!
tableReference:

tablePrimary
[
[
AS
]
alias
[
(
columnAlias
[,
columnAlias
]*
)
]
]

!
tablePrimary:

[
TABLE
]
[
[
catalogName
.
]
schemaName
.
]
tableName

|

(
query
)

|

VALUES
expression
[,
expression
]*

|

(
TABLE
expression
)

!
windowRef:

windowName

|

windowSpec

!
windowSpec:

[
windowName
]

(

[
ORDER
BY
orderItem
[,
orderItem
]*
]

[
PARTITION
BY
expression
[,
expression
]*
]

{

RANGE
numericOrInterval
{
PRECEDING
|
FOLLOWING
}

|

ROWS
numeric
{
PRECEDING
|
FOLLOWING
}

}

)
Lingual 1.1 -> Optiq 0.4.12.3
https://github.com/julianhyde/optiq/blob/master/REFERENCE.md

Lingual provides two interfaces.
APIS
25

Allows SQL and non-SQL Flows to work
together as a single application via
conceptually similar interfaces
CASCADINGAPI
26

27
Cascading API

!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "sqlflow" )!
.addSource( "example.employee", emplTap )!
.addSource( "example.sales", salesTap )!
.addSink( "results", resultsTap );!
!
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
!
flowDef.addAssemblyPlanner( sqlPlanner );!
!
Flow
flow
=
new
HadoopFlowConnector().connect(
flowDef
);

!
flow.complete();

So Systems and People can talk directly to
Hadoop visible data
JDBCAPI
28

29
JDBC driver

public void run() throws ClassNotFoundException, SQLException {!
Class.forName( "cascading.lingual.jdbc.Driver" );!
Connection connection =!
DriverManager.getConnection(!
"jdbc:lingual:local;schemas=src/main/resources/data/example"
);!
Statement statement = connection.createStatement();!
!
ResultSet resultSet = statement.executeQuery(!
"select *n"!
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"!
+ "join "EXAMPLE"."EMPLOYEE" as en"!
+ "on e."EMPID" = s."CUST_ID"" );!
!
// do something!
!
resultSet.close();!
statement.close();!
connection.close();!
}

JDBC
30
Server / Desktop
JDBC
FlowAssembly
Cluster
JobJobSQL
select * from
employees ...
SQL
select * from
employees ...
SQL
select * from
employees ...
lingual-hadoop-1.1.0-jdbc.jar
meta-data catalog

select dept_no, avg( max_salary ) from employees.dept_emp,
( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries
group by emp_no )
where dept_emp.emp_no = sal_emp_no group by dept_no;
SUB-QUERY
32

ACCESSHADOOPFROMR
33
# load the JDBC package!
library(RJDBC)!
!
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!
!
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!
!
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
!
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

RESULTS
34
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92

INTEGRATION
35
But I use a custom data format!

• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster, on the ﬂy
DATAPROVIDERAPI
36

DATAPROVIDER
37
JDBC
Maven Repo
Assembly Flow
Cluster
JobJob
lingual-hadoop-1.1.0-jdbc.jar
cascading-jdbc-oracle-provider.jar
your-avro-provider.jar

AMAZONEMR&REDSHIFT
38
Amazon Elastic MapReduce
Job Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedShift
file1 file2
results
http://docs.cascading.org/tutorials/lingual-redshift/

All Cascading applications can be
visualized and monitored …
MANAGED
39

• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
40
http://cascading.io/driven/

ABOOK!
43
Enterprise DataWorkﬂows 
with Cascading

O’Reilly, 2013

amazon.com/dp/1449358721

chris@concurrentinc.com
!
!
@cwensel
DONE
44

Hadoop Summit EU 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Hadoop Summit EU 2014

Similar to Hadoop Summit EU 2014 (20)

More from cwensel

More from cwensel (6)

Hadoop Summit EU 2014