SlideShare a Scribd company logo
1 of 44
Download to read offline
DATAINTEGRATIONAND
SQLAPPLICATIONMIGRATION
WITH
CASCADINGLINGUAL
Chris K Wensel | Hadoop Summit EU 2014
• Not a “data scientist”
• No idea what “big data” means
• Used MR in anger once, and did it wrong
• Author of Cascading
• Co-Author of Lingual (w/ Julian Hyde)
CHRISKWENSEL
2
3
Why is Hadoop & “big data” a thing?
More is better
HADOOP&BIGDATA
4
More Data
More Machines
More Algorithms
More Tools
HADOOP&BIGDATA
5
Worse is better
HADOOP&BIGDATA
6
Less red tape
More degrees of freedom
No upfront design
HADOOP&BIGDATA
7
8
Why Cascading?
Makes hard things possible.
CASCADING
9
While helping to retain
Conceptual Integrity.
CASCADING
10
"the speed of innovation is
proportional to the arrival rate of
answers to questions"
HADOOP&BIGDATA
11
True when you are questioning
Data, Algorithms, and
Architecture
CASCADING
12
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
13
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Compute
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
ECOSYSTEM
14
Lingual Pattern
Cascading
Hadoop MR
Scalding Cascalog
Hadoop Tez Whatever
• Started in 2007
• 2.0 released June 2012
• 2.5 stable out now
• 3.0 wip now available
• Tez support coming soon
• Apache Licensed Open-Source
• Supports all Hadoop 1 & 2 distros
CASCADING
15
ANSI SQL
on Cascading
on Whatever
LINGUAL
16
How’s this different than all the
other “SQL for Hadoop” projects?
LINGUAL
17
Not intended as
an ad-hoc query interface.
[Lingual is only as fast as Hadoop]
WHYLINGUAL?
18
Is intended to be
as standards compliant as
possible.
WHYLINGUAL?
19
Migrate workloads from expensive systems
to less expensive Hadoop
WHYLINGUAL?
20
Liberate the data trapped on Hadoop w/o
involving an Engineer
WHYLINGUAL?
21
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
22
Query Planner
JDBC API Lingual APIProvider API
Cascading
Compute
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
• SQL-92
• Character, Numeric, and Temporal types
• IN and CASE
• FROM sub-queries
• CAST and CONVERT
• CURRENT_*
ANSISQL
23
http://docs.cascading.org/lingual/1.1/#sql-support
24
query:	
  
	
  	
  {	
  
	
  	
  	
  	
  	
  	
  select	
  
	
  	
  |	
  	
  	
  query	
  UNION	
  [	
  ALL	
  ]	
  query	
  
	
  	
  |	
  	
  	
  query	
  EXCEPT	
  query	
  
	
  	
  |	
  	
  	
  query	
  INTERSECT	
  query	
  
	
  	
  }	
  
	
  	
  [	
  ORDER	
  BY	
  orderItem	
  [,	
  orderItem	
  ]*	
  ]	
  
	
  	
  [	
  LIMIT	
  {	
  count	
  |	
  ALL	
  }	
  ]	
  
	
  	
  [	
  OFFSET	
  start	
  {	
  ROW	
  |	
  ROWS	
  }	
  ]	
  
	
  	
  [	
  FETCH	
  {	
  FIRST	
  |	
  NEXT	
  }	
  [	
  count	
  ]	
  {	
  ROW	
  |	
  ROWS	
  }	
  ]	
  
!
orderItem:	
  
	
  	
  expression	
  [	
  ASC	
  |	
  DESC	
  ]	
  [	
  NULLS	
  FIRST	
  |	
  NULLS	
  LAST	
  ]	
  
!
select:	
  
	
  	
  SELECT	
  [	
  ALL	
  |	
  DISTINCT	
  ]	
  
	
  	
  	
  	
  	
  	
  {	
  *	
  |	
  projectItem	
  [,	
  projectItem	
  ]*	
  }	
  
	
  	
  FROM	
  tableExpression	
  
	
  	
  [	
  WHERE	
  booleanExpression	
  ]	
  
	
  	
  [	
  GROUP	
  BY	
  {	
  ()	
  |	
  expression	
  [,	
  expression]*	
  }	
  ]	
  
	
  	
  [	
  HAVING	
  booleanExpression	
  ]	
  
	
  	
  [	
  WINDOW	
  windowName	
  AS	
  windowSpec	
  [,	
  windowName	
  AS	
  windowSpec	
  ]*	
  ]	
  
!
projectItem:	
  
	
  	
  	
  	
  	
  	
  expression	
  [	
  [	
  AS	
  ]	
  columnAlias	
  ]	
  
	
  	
  |	
  	
  	
  tableAlias	
  .	
  *	
  
tableExpression:	
  
	
  	
  	
  	
  	
  	
  tableReference	
  [,	
  tableReference	
  ]*	
  
	
  	
  |	
  	
  	
  tableExpression	
  [	
  NATURAL	
  ]	
  [	
  LEFT	
  |	
  RIGHT	
  |	
  FULL	
  ]	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  JOIN	
  tableExpression	
  [	
  joinCondition	
  ]	
  
!
joinCondition:	
  
	
  	
  	
  	
  	
  	
  ON	
  booleanExpression	
  
	
  	
  |	
  	
  	
  USING	
  (	
  column	
  [,	
  column	
  ]*	
  )	
  
!
tableReference:	
  
	
  	
  tablePrimary	
  [	
  [	
  AS	
  ]	
  alias	
  [	
  (	
  columnAlias	
  [,	
  columnAlias	
  ]*	
  )	
  ]	
  ]	
  
!
tablePrimary:	
  
	
  	
  	
  	
  	
  	
  [	
  TABLE	
  ]	
  [	
  [	
  catalogName	
  .	
  ]	
  schemaName	
  .	
  ]	
  tableName	
  
	
  	
  |	
  	
  	
  (	
  query	
  )	
  
	
  	
  |	
  	
  	
  VALUES	
  expression	
  [,	
  expression	
  ]*	
  
	
  	
  |	
  	
  	
  (	
  TABLE	
  expression	
  )	
  
!
windowRef:	
  
	
  	
  	
  	
  	
  	
  windowName	
  
	
  	
  |	
  	
  	
  windowSpec	
  
!
windowSpec:	
  
	
  	
  [	
  windowName	
  ]	
  
	
  	
  (	
  
	
  	
  	
  	
  	
  	
  [	
  ORDER	
  BY	
  orderItem	
  [,	
  orderItem	
  ]*	
  ]	
  
	
  	
  	
  	
  	
  	
  [	
  PARTITION	
  BY	
  expression	
  [,	
  expression	
  ]*	
  ]	
  
	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  RANGE	
  numericOrInterval	
  {	
  PRECEDING	
  |	
  FOLLOWING	
  }	
  
	
  	
  	
  	
  	
  	
  |	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ROWS	
  numeric	
  {	
  PRECEDING	
  |	
  FOLLOWING	
  }	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  )
Lingual 1.1 -> Optiq 0.4.12.3
https://github.com/julianhyde/optiq/blob/master/REFERENCE.md
Lingual provides two interfaces.
APIS
25
Allows SQL and non-SQL Flows to work
together as a single application via
conceptually similar interfaces
CASCADINGAPI
26
27
Cascading API	

!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "sqlflow" )!
.addSource( "example.employee", emplTap )!
.addSource( "example.sales", salesTap )!
.addSink( "results", resultsTap );!
 !
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
 !
flowDef.addAssemblyPlanner( sqlPlanner );!
!
Flow	
  flow	
  =	
  new	
  HadoopFlowConnector().connect(	
  flowDef	
  );	
  
!
flow.complete();
So Systems and People can talk directly to
Hadoop visible data
JDBCAPI
28
29
JDBC driver	

public void run() throws ClassNotFoundException, SQLException {!
Class.forName( "cascading.lingual.jdbc.Driver" );!
Connection connection =!
DriverManager.getConnection(!
"jdbc:lingual:local;schemas=src/main/resources/data/example"
);!
Statement statement = connection.createStatement();!
 !
ResultSet resultSet = statement.executeQuery(!
"select *n"!
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"!
+ "join "EXAMPLE"."EMPLOYEE" as en"!
+ "on e."EMPID" = s."CUST_ID"" );!
 !
// do something!
 !
resultSet.close();!
statement.close();!
connection.close();!
}
JDBC
30
Server / Desktop
JDBC
FlowAssembly
Cluster
JobJobSQL
select * from
employees ...
SQL
select * from
employees ...
SQL
select * from
employees ...
lingual-hadoop-1.1.0-jdbc.jar
meta-data catalog
DEFAULTSHELL
31
select dept_no, avg( max_salary ) from employees.dept_emp,
( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries
group by emp_no )
where dept_emp.emp_no = sal_emp_no group by dept_no;
SUB-QUERY
32
ACCESSHADOOPFROMR
33
# load the JDBC package!
library(RJDBC)!
 !
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!
 !
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!
 !
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
 !
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
RESULTS
34
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
INTEGRATION
35
But I use a custom data format!
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster, on the fly
DATAPROVIDERAPI
36
DATAPROVIDER
37
JDBC
Maven Repo
Assembly Flow
Cluster
JobJob
lingual-hadoop-1.1.0-jdbc.jar
cascading-jdbc-oracle-provider.jar
your-avro-provider.jar
AMAZONEMR&REDSHIFT
38
Amazon Elastic MapReduce
Job Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedShift
file1 file2
results
http://docs.cascading.org/tutorials/lingual-redshift/
All Cascading applications can be
visualized and monitored …
MANAGED
39
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
40
http://cascading.io/driven/
MANAGEDWITHDRIVEN
41
42
ABOOK!
43
Enterprise DataWorkflows

with Cascading	

O’Reilly, 2013	

amazon.com/dp/1449358721
chris@concurrentinc.com
!
!
@cwensel
DONE
44

More Related Content

What's hot

Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
supporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered tablesupporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered tableMahabubur Rahaman
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212Mahmoud Samir Fayed
 
JavaScript Arrays
JavaScript Arrays JavaScript Arrays
JavaScript Arrays Reem Alattas
 
Collections Framework Begineers guide 2
Collections Framework Begineers guide 2Collections Framework Begineers guide 2
Collections Framework Begineers guide 2Kenji HASUNUMA
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
Micro-ORM Introduction - Don't overcomplicate
Micro-ORM Introduction - Don't overcomplicateMicro-ORM Introduction - Don't overcomplicate
Micro-ORM Introduction - Don't overcomplicateKiev ALT.NET
 
Diving into MySQL 5.7: advanced features
Diving into MySQL 5.7: advanced featuresDiving into MySQL 5.7: advanced features
Diving into MySQL 5.7: advanced featuresGabriela Ferrara
 
NyaruDBにゃるものを使ってみた話 (+Realm比較)
NyaruDBにゃるものを使ってみた話 (+Realm比較)NyaruDBにゃるものを使ってみた話 (+Realm比較)
NyaruDBにゃるものを使ってみた話 (+Realm比較)Masaki Oshikawa
 
How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15oysteing
 
Data Love Conference - Window Functions for Database Analytics
Data Love Conference - Window Functions for Database AnalyticsData Love Conference - Window Functions for Database Analytics
Data Love Conference - Window Functions for Database AnalyticsDave Stokes
 
Episode 4 - Introduction to SOQL in Salesforce
Episode 4  - Introduction to SOQL in SalesforceEpisode 4  - Introduction to SOQL in Salesforce
Episode 4 - Introduction to SOQL in SalesforceJitendra Zaa
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)Michael Rys
 
XML Business Rules Validation with Schematron
XML Business Rules Validation with SchematronXML Business Rules Validation with Schematron
XML Business Rules Validation with SchematronEmiel Paasschens
 

What's hot (18)

Explain that explain
Explain that explainExplain that explain
Explain that explain
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
supporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered tablesupporting t-sql scripts for Heap vs clustered table
supporting t-sql scripts for Heap vs clustered table
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
zekeLabs sql-slides
zekeLabs sql-slideszekeLabs sql-slides
zekeLabs sql-slides
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212
 
JavaScript Arrays
JavaScript Arrays JavaScript Arrays
JavaScript Arrays
 
Collections Framework Begineers guide 2
Collections Framework Begineers guide 2Collections Framework Begineers guide 2
Collections Framework Begineers guide 2
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
Micro-ORM Introduction - Don't overcomplicate
Micro-ORM Introduction - Don't overcomplicateMicro-ORM Introduction - Don't overcomplicate
Micro-ORM Introduction - Don't overcomplicate
 
Diving into MySQL 5.7: advanced features
Diving into MySQL 5.7: advanced featuresDiving into MySQL 5.7: advanced features
Diving into MySQL 5.7: advanced features
 
NyaruDBにゃるものを使ってみた話 (+Realm比較)
NyaruDBにゃるものを使ってみた話 (+Realm比較)NyaruDBにゃるものを使ってみた話 (+Realm比較)
NyaruDBにゃるものを使ってみた話 (+Realm比較)
 
How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15How to analyze and tune sql queries for better performance percona15
How to analyze and tune sql queries for better performance percona15
 
Schematron
SchematronSchematron
Schematron
 
Data Love Conference - Window Functions for Database Analytics
Data Love Conference - Window Functions for Database AnalyticsData Love Conference - Window Functions for Database Analytics
Data Love Conference - Window Functions for Database Analytics
 
Episode 4 - Introduction to SOQL in Salesforce
Episode 4  - Introduction to SOQL in SalesforceEpisode 4  - Introduction to SOQL in Salesforce
Episode 4 - Introduction to SOQL in Salesforce
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)
 
XML Business Rules Validation with Schematron
XML Business Rules Validation with SchematronXML Business Rules Validation with Schematron
XML Business Rules Validation with Schematron
 

Similar to Hadoop Summit EU 2014

Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Plone For Developers - World Plone Day, 2009
Plone For Developers - World Plone Day, 2009Plone For Developers - World Plone Day, 2009
Plone For Developers - World Plone Day, 2009Core Software Group
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016Duyhai Doan
 
Introducción rápida a SQL
Introducción rápida a SQLIntroducción rápida a SQL
Introducción rápida a SQLCarlos Hernando
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 CourseMarcus Davage
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Rich Internet Applications con JavaFX e NetBeans
Rich Internet Applications  con JavaFX e NetBeans Rich Internet Applications  con JavaFX e NetBeans
Rich Internet Applications con JavaFX e NetBeans Fabrizio Giudici
 
Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)Vidyasagar Mundroy
 
Reactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiReactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiAziz Khambati
 
vFabric SQLFire Introduction
vFabric SQLFire IntroductionvFabric SQLFire Introduction
vFabric SQLFire IntroductionJags Ramnarayan
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Managing GraphQL servers with AWS Fargate & Prisma Cloud
Managing GraphQL servers  with AWS Fargate & Prisma CloudManaging GraphQL servers  with AWS Fargate & Prisma Cloud
Managing GraphQL servers with AWS Fargate & Prisma CloudNikolas Burk
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 

Similar to Hadoop Summit EU 2014 (20)

Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Plone For Developers - World Plone Day, 2009
Plone For Developers - World Plone Day, 2009Plone For Developers - World Plone Day, 2009
Plone For Developers - World Plone Day, 2009
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Introducción rápida a SQL
Introducción rápida a SQLIntroducción rápida a SQL
Introducción rápida a SQL
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 Course
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Rich Internet Applications con JavaFX e NetBeans
Rich Internet Applications  con JavaFX e NetBeans Rich Internet Applications  con JavaFX e NetBeans
Rich Internet Applications con JavaFX e NetBeans
 
Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)
 
Reactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiReactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz Khambati
 
vFabric SQLFire Introduction
vFabric SQLFire IntroductionvFabric SQLFire Introduction
vFabric SQLFire Introduction
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Managing GraphQL servers with AWS Fargate & Prisma Cloud
Managing GraphQL servers  with AWS Fargate & Prisma CloudManaging GraphQL servers  with AWS Fargate & Prisma Cloud
Managing GraphQL servers with AWS Fargate & Prisma Cloud
 
Android L01 - Warm Up
Android L01 - Warm UpAndroid L01 - Warm Up
Android L01 - Warm Up
 
Python database access
Python database accessPython database access
Python database access
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 

More from cwensel

BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011cwensel
 
Cascading and BigData Problems
Cascading and BigData ProblemsCascading and BigData Problems
Cascading and BigData Problemscwensel
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 

More from cwensel (6)

BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011
 
Cascading and BigData Problems
Cascading and BigData ProblemsCascading and BigData Problems
Cascading and BigData Problems
 
Buzz words
Buzz wordsBuzz words
Buzz words
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 

Hadoop Summit EU 2014

  • 2. • Not a “data scientist” • No idea what “big data” means • Used MR in anger once, and did it wrong • Author of Cascading • Co-Author of Lingual (w/ Julian Hyde) CHRISKWENSEL 2
  • 3. 3 Why is Hadoop & “big data” a thing?
  • 5. More Data More Machines More Algorithms More Tools HADOOP&BIGDATA 5
  • 7. Less red tape More degrees of freedom No upfront design HADOOP&BIGDATA 7
  • 9. Makes hard things possible. CASCADING 9
  • 10. While helping to retain Conceptual Integrity. CASCADING 10
  • 11. "the speed of innovation is proportional to the arrival rate of answers to questions" HADOOP&BIGDATA 11
  • 12. True when you are questioning Data, Algorithms, and Architecture CASCADING 12
  • 13. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 13 Process Planner Processing API Integration API Scheduler API Scheduler Compute Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 15. • Started in 2007 • 2.0 released June 2012 • 2.5 stable out now • 3.0 wip now available • Tez support coming soon • Apache Licensed Open-Source • Supports all Hadoop 1 & 2 distros CASCADING 15
  • 16. ANSI SQL on Cascading on Whatever LINGUAL 16
  • 17. How’s this different than all the other “SQL for Hadoop” projects? LINGUAL 17
  • 18. Not intended as an ad-hoc query interface. [Lingual is only as fast as Hadoop] WHYLINGUAL? 18
  • 19. Is intended to be as standards compliant as possible. WHYLINGUAL? 19
  • 20. Migrate workloads from expensive systems to less expensive Hadoop WHYLINGUAL? 20
  • 21. Liberate the data trapped on Hadoop w/o involving an Engineer WHYLINGUAL? 21
  • 22. • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 22 Query Planner JDBC API Lingual APIProvider API Cascading Compute Lingual Data Stores CLI / Shell Enterprise Java Catalog
  • 23. • SQL-92 • Character, Numeric, and Temporal types • IN and CASE • FROM sub-queries • CAST and CONVERT • CURRENT_* ANSISQL 23 http://docs.cascading.org/lingual/1.1/#sql-support
  • 24. 24 query:      {              select      |      query  UNION  [  ALL  ]  query      |      query  EXCEPT  query      |      query  INTERSECT  query      }      [  ORDER  BY  orderItem  [,  orderItem  ]*  ]      [  LIMIT  {  count  |  ALL  }  ]      [  OFFSET  start  {  ROW  |  ROWS  }  ]      [  FETCH  {  FIRST  |  NEXT  }  [  count  ]  {  ROW  |  ROWS  }  ]   ! orderItem:      expression  [  ASC  |  DESC  ]  [  NULLS  FIRST  |  NULLS  LAST  ]   ! select:      SELECT  [  ALL  |  DISTINCT  ]              {  *  |  projectItem  [,  projectItem  ]*  }      FROM  tableExpression      [  WHERE  booleanExpression  ]      [  GROUP  BY  {  ()  |  expression  [,  expression]*  }  ]      [  HAVING  booleanExpression  ]      [  WINDOW  windowName  AS  windowSpec  [,  windowName  AS  windowSpec  ]*  ]   ! projectItem:              expression  [  [  AS  ]  columnAlias  ]      |      tableAlias  .  *   tableExpression:              tableReference  [,  tableReference  ]*      |      tableExpression  [  NATURAL  ]  [  LEFT  |  RIGHT  |  FULL  ]                    JOIN  tableExpression  [  joinCondition  ]   ! joinCondition:              ON  booleanExpression      |      USING  (  column  [,  column  ]*  )   ! tableReference:      tablePrimary  [  [  AS  ]  alias  [  (  columnAlias  [,  columnAlias  ]*  )  ]  ]   ! tablePrimary:              [  TABLE  ]  [  [  catalogName  .  ]  schemaName  .  ]  tableName      |      (  query  )      |      VALUES  expression  [,  expression  ]*      |      (  TABLE  expression  )   ! windowRef:              windowName      |      windowSpec   ! windowSpec:      [  windowName  ]      (              [  ORDER  BY  orderItem  [,  orderItem  ]*  ]              [  PARTITION  BY  expression  [,  expression  ]*  ]              {                      RANGE  numericOrInterval  {  PRECEDING  |  FOLLOWING  }              |                      ROWS  numeric  {  PRECEDING  |  FOLLOWING  }              }      ) Lingual 1.1 -> Optiq 0.4.12.3 https://github.com/julianhyde/optiq/blob/master/REFERENCE.md
  • 25. Lingual provides two interfaces. APIS 25
  • 26. Allows SQL and non-SQL Flows to work together as a single application via conceptually similar interfaces CASCADINGAPI 26
  • 27. 27 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! Flow  flow  =  new  HadoopFlowConnector().connect(  flowDef  );   ! flow.complete();
  • 28. So Systems and People can talk directly to Hadoop visible data JDBCAPI 28
  • 29. 29 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
  • 30. JDBC 30 Server / Desktop JDBC FlowAssembly Cluster JobJobSQL select * from employees ... SQL select * from employees ... SQL select * from employees ... lingual-hadoop-1.1.0-jdbc.jar meta-data catalog
  • 32. select dept_no, avg( max_salary ) from employees.dept_emp, ( select emp_no as sal_emp_no, max( salary ) as max_salary from employees.salaries group by emp_no ) where dept_emp.emp_no = sal_emp_no group by dept_no; SUB-QUERY 32
  • 33. ACCESSHADOOPFROMR 33 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
  • 34. RESULTS 34 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
  • 35. INTEGRATION 35 But I use a custom data format!
  • 36. • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster, on the fly DATAPROVIDERAPI 36
  • 38. AMAZONEMR&REDSHIFT 38 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results http://docs.cascading.org/tutorials/lingual-redshift/
  • 39. All Cascading applications can be visualized and monitored … MANAGED 39
  • 40. • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 40 http://cascading.io/driven/
  • 42. 42