• Not a “data scientist”
• No idea what “big data” means
• Used MR in anger once, and did it wrong
• Author of Cascading
• Co-Author of Lingual (w/ Julian Hyde)
CHRISKWENSEL
2
"the speed of innovation is
proportional to the arrival rate of
answers to questions"
HADOOP&BIGDATA
11
True when you are questioning
Data, Algorithms, and
Architecture
CASCADING
12
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
13
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Compute
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
• Started in 2007
• 2.0 released June 2012
• 2.5 stable out now
• 3.0 wip now available
• Tez support coming soon
• Apache Licensed Open-Source
• Supports all Hadoop 1 & 2 distros
CASCADING
15
Liberate the data trapped on Hadoop w/o
involving an Engineer
WHYLINGUAL?
21
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
22
Query Planner
JDBC API Lingual APIProvider API
Cascading
Compute
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
• SQL-92
• Character, Numeric, and Temporal types
• IN and CASE
• FROM sub-queries
• CAST and CONVERT
• CURRENT_*
ANSISQL
23
http://docs.cascading.org/lingual/1.1/#sql-support
24
query:
{
select
|
query
UNION
[
ALL
]
query
|
query
EXCEPT
query
|
query
INTERSECT
query
}
[
ORDER
BY
orderItem
[,
orderItem
]*
]
[
LIMIT
{
count
|
ALL
}
]
[
OFFSET
start
{
ROW
|
ROWS
}
]
[
FETCH
{
FIRST
|
NEXT
}
[
count
]
{
ROW
|
ROWS
}
]
!
orderItem:
expression
[
ASC
|
DESC
]
[
NULLS
FIRST
|
NULLS
LAST
]
!
select:
SELECT
[
ALL
|
DISTINCT
]
{
*
|
projectItem
[,
projectItem
]*
}
FROM
tableExpression
[
WHERE
booleanExpression
]
[
GROUP
BY
{
()
|
expression
[,
expression]*
}
]
[
HAVING
booleanExpression
]
[
WINDOW
windowName
AS
windowSpec
[,
windowName
AS
windowSpec
]*
]
!
projectItem:
expression
[
[
AS
]
columnAlias
]
|
tableAlias
.
*
tableExpression:
tableReference
[,
tableReference
]*
|
tableExpression
[
NATURAL
]
[
LEFT
|
RIGHT
|
FULL
]
JOIN
tableExpression
[
joinCondition
]
!
joinCondition:
ON
booleanExpression
|
USING
(
column
[,
column
]*
)
!
tableReference:
tablePrimary
[
[
AS
]
alias
[
(
columnAlias
[,
columnAlias
]*
)
]
]
!
tablePrimary:
[
TABLE
]
[
[
catalogName
.
]
schemaName
.
]
tableName
|
(
query
)
|
VALUES
expression
[,
expression
]*
|
(
TABLE
expression
)
!
windowRef:
windowName
|
windowSpec
!
windowSpec:
[
windowName
]
(
[
ORDER
BY
orderItem
[,
orderItem
]*
]
[
PARTITION
BY
expression
[,
expression
]*
]
{
RANGE
numericOrInterval
{
PRECEDING
|
FOLLOWING
}
|
ROWS
numeric
{
PRECEDING
|
FOLLOWING
}
}
)
Lingual 1.1 -> Optiq 0.4.12.3
https://github.com/julianhyde/optiq/blob/master/REFERENCE.md
select dept_no, avg( max_salary ) from employees.dept_emp,
( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries
group by emp_no )
where dept_emp.emp_no = sal_emp_no group by dept_no;
SUB-QUERY
32
ACCESSHADOOPFROMR
33
# load the JDBC package!
library(RJDBC)!
!
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!
!
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!
!
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
!
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster, on the fly
DATAPROVIDERAPI
36
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
40
http://cascading.io/driven/