SlideShare a Scribd company logo
SQL 
and 
high 
performance 
compu3ng 
on 
Hadoop 
Graham 
Mossman, 
Senior 
Solu;on 
Engineer, 
EXASOL 
© 
2014 
EXASOL 
AG
© 
2014 
EXASOL 
AG 
I 
Love 
My 
Lawnmower 
... 
... 
because 
it 
cuts 
my 
grass 
well
© 
2014 
EXASOL 
AG 
But 
... 
... 
it‘s 
quite 
a 
struggle 
cuBng 
my 
hedge
© 
2014 
EXASOL 
AG 
And 
... 
... 
it 
isn‘t 
good 
at 
making 
apple 
sauce
© 
2014 
EXASOL 
AG 
And 
don‘t 
even 
thinking 
about... 
... 
using 
it 
to 
cut 
hair
© 
2014 
EXASOL 
AG 
Hadoop 
today 
is 
… 
§ S;ll 
Open 
Source 
! 
§ Began 
with 
HDFS 
and 
Map/Reduce 
§ Now 
comprises 
a 
number 
of 
addi;onal 
technologies 
§ File 
systems 
§ (e.g. 
Tachyon) 
§ Cluster 
Managers 
§ (e.g. 
YARN 
+ 
Mesos) 
§ Execu;on 
Engines 
§ (e.g. 
Tez, 
Spark 
etc.) 
§ Analy;cal 
Layer 
and 
Applica;ons 
§ 
(e.g. 
Hive, 
Pig, 
various 
SQL 
on 
Hadoop)
© 
2014 
EXASOL 
AG 
Hadoop 
With 
Everything 
§ Hadoop 
was 
invented 
to 
more 
easily 
distribute 
the 
Nutch 
and 
Lucene 
applica;ons 
across 
a 
cluster 
of 
machines. 
§ Map/Reduce 
– 
distributed 
processing 
§ HDFS 
– 
distributed 
file 
system 
§ Began 
to 
be 
used 
for 
…. 
just 
about 
everything. 
§ But 
not 
all 
processing 
tasks 
are 
like 
indexing 
the 
Internet 
§ Hadoop 
started 
to 
acract 
cri;cism 
§ But 
usually 
when 
it 
was 
being 
used 
for 
something 
it 
wasn’t 
designed 
for
© 
2014 
EXASOL 
AG 
Definitely 
NOT 
jobs 
for 
Hadoop 
§ Word 
processing 
§ Payroll 
system 
§ Anything 
on 
a 
single 
computer 
§ Anything 
with 
“small” 
data
© 
2014 
EXASOL 
AG 
Analy3cal 
Queries 
§ “GROUP 
BY“ 
logic 
§ i.e. 
not 
concerned 
with 
individual 
data 
items 
§ Analy;cal 
Func;ons 
§ MAX, 
MEDIAN, 
MIN, 
SUM, 
COUNT, 
STANDARD 
DEVIATION 
… 
§ Table 
joins, 
nested 
sub-­‐queries 
Usually 
short-­‐running, 
ad-­‐hoc 
and 
submiced 
many 
at 
a 
;me.
Map/Reduce 
and 
HDFS 
: 
the 
wrong 
tools 
for 
Analy3cs 
? 
§ Queries 
tend 
to 
be 
short 
: 
fault 
tolerance 
is 
less 
important 
© 
2014 
EXASOL 
AG 
§ If 
chance 
of 
failure 
in 
a 
5 
hour 
batch 
is 
1 
in 
300 
§ Chance 
of 
failure 
in 
a 
5 
second 
query 
is 
1 
in 
1,000,000 
§ Queries 
tend 
to 
be 
short 
: 
start-­‐up 
;me 
is 
significant 
§ a 
20 
second 
start-­‐up 
;me 
is 
NOT 
OK 
on 
a 
5 
second 
query 
§ A 
number 
of 
projects 
started 
to 
address 
these 
issues 
§ e.g. 
“Hot 
containers” 
in 
Hive 
on 
Tez 
to 
reduce 
start-­‐up 
;me
Map/Reduce: 
the 
wrong 
language 
for 
Analy3cs 
? 
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation 
© 
2014 
EXASOL 
AG 
Stage 0: Map-Shuffle-Reduce 
Mapper(row) { 
fields = row.split("t") 
emit(fields[0], 
fields[1]); 
} 
Reducer(key, values) { 
sum = 0; 
for (value in values) { 
sum += value; 
} 
emit(key, sum); 
} 
Stage 1: Map-Shuffle 
Mapper(row) { 
... 
emit(page_views, 
page_name); 
} 
... shuffle 
Stage 2: Local 
data = open("stage1.out") 
for (i in 0 to 10) { 
print(data.getNext()) 
}
© 
2014 
EXASOL 
AG 
Equivalent 
in 
SQL 
SELECT 
page_name, 
SUM(page_views) views 
FROM wikistats 
GROUP BY page_name 
ORDER BY views DESC 
LIMIT 10;
© 
2014 
EXASOL 
AG 
The 
SQL 
language 
§ Portable 
§ Well-­‐defined 
standards 
exist 
§ No 
detailed 
knowledge 
of 
the 
plaporm 
required 
§ e.g. 
you 
don’t 
need 
to 
manage 
memory 
§ SQL 
is 
assumed 
by 
a 
lot 
of 
repor;ng 
tools 
§ Widely 
used 
and 
understood 
even 
by 
non-­‐technical 
people
© 
2014 
EXASOL 
AG 
I‘m 
not 
saying 
that 
SQL 
is 
perfect 
• Try writing the simple Hadoop “Word 
Count” example in pure SQL 
• Or try to “sessionise” weblog data 
• Or anything with data that is not 
structured 
• “Which part of STRUCTURED Query Language 
don’t you understand …?!” 
• All I’m saying is that is an excellent 
language for analytical queries.
Hadoop 
could 
handle 
SQL 
(via 
Hive), 
but 
historically 
… 
© 
2014 
EXASOL 
AG 
§ High 
Latency 
§ Restricted 
SQL 
op;ons 
§ All 
but 
simple 
table 
joins 
were 
difficult 
§ Licle 
support 
for 
compression 
& 
indexing 
§ Merv 
Adrian 
(Gartner 
Research 
-­‐ 
2014) 
§ “What 
is 
remarkable 
is 
that 
Hadoop 
does 
SQL. 
Just 
don’t 
expect 
it 
to 
do 
it 
well” 
§ Result 
: 
EVERYTHING 
looked 
good 
compared 
to 
Hive
© 
2014 
EXASOL 
AG 
Everyone 
s3ll 
likes 
to 
compare 
themselves 
to 
Hive
© 
2014 
EXASOL 
AG 
EXASOL 
being 
no 
excep3on 
!
© 
2014 
EXASOL 
AG 
Hive 
con3nues 
to 
be 
improved 
… 
§ Completed 
§ Views 
(HIVE-­‐1143) 
§ Par;;oned 
Views 
(HIVE-­‐1941) 
§ Storage 
Handlers 
(HIVE-­‐705) 
§ HBase 
Integra;on 
§ HBase 
Bulk 
Load 
§ Locking 
(HIVE-­‐1293) 
§ Indexes 
(HIVE-­‐417) 
§ Bitmap 
Indexes 
(HIVE-­‐1803) 
§ Filter 
Pushdown 
(HIVE-­‐279) 
§ Table-­‐level 
Sta;s;cs 
(HIVE-­‐1361) 
§ Dynamic 
Par;;ons 
§ Binary 
Data 
Type 
(HIVE-­‐2380) 
§ Decimal 
Precision 
and 
Scale 
Support 
§ HCatalog 
§ HiveServer2 
(HIVE-­‐2935) 
§ Column 
Sta;s;cs 
in 
Hive 
(HIVE-­‐1362) 
§ List 
Bucke;ng 
(HIVE-­‐3026) 
§ Group 
By 
With 
Rollup 
(HIVE-­‐2397) 
§ Enhanced 
Aggrega;on, 
Cube, 
Grouping 
and 
Rollup 
(HIVE-­‐3433) 
§ Op;mizing 
Skewed 
Joins 
(HIVE-­‐3086) 
§ Correla;on 
Op;mizer 
(HIVE-­‐2206) 
§ Hive 
on 
Tez 
(HIVE-­‐4660) 
§ Vectorized 
Query 
Execu;on 
(HIVE-­‐4160) 
§ In 
Progress 
§ Atomic 
Insert/Update/Delete 
(HIVE-­‐5317) 
§ Transac;on 
Manager 
(HIVE-­‐5843) 
§ Cost 
Based 
Op;mizer 
in 
Hive 
(HIVE-­‐5775) 
§ Proposed 
§ Spa;al 
Queries 
§ Theta 
Join 
(HIVE-­‐556) 
§ JDBC 
Storage 
Handler 
§ MapJoin 
Op;miza;on 
§ Proposal 
to 
standardize 
and 
expand 
Authoriza;on 
in 
Hive 
§ Dependent 
Tables 
(HIVE-­‐3466) 
§ AccessServer 
§ Type 
Qualifiers 
in 
Hive 
§ MapJoin 
& 
Par;;on 
Pruning 
(HIVE-­‐5119) 
§ SQL 
Standard 
based 
secure 
authoriza;on 
(HIVE-­‐5837) 
§ Updatable 
Views 
(HIVE-­‐1143) 
§ Hive 
on 
Spark 
(HIVE-­‐7292)
The 
dream 
data 
architecture 
for 
analy3cs 
… 
§ Based 
on 
the 
SQL 
language 
§ but 
leverages 
Hadoop’s 
extreme 
scalability 
§ and 
Hadoop’s 
fault 
tolerance 
§ while 
not 
compromising 
on 
speed. 
© 
2014 
EXASOL 
AG 
Could 
it 
please 
also 
have 
some 
maturity 
? 
And 
be 
easy 
to 
use 
?
© 
2014 
EXASOL 
AG 
The 
current 
reality 
§ SQL 
on 
SQL, 
which 
is 
arguably 
§ Less 
scalable 
§ Less 
fault 
tolerant 
§ Less 
good 
with 
unstructured 
data 
§ SQL 
on 
Hadoop, 
which 
is 
arguably 
§ Less 
mature 
§ Less 
easy 
to 
use 
§ Slower
© 
2014 
EXASOL 
AG 
Choices 
for 
SQL 
and 
Hadoop 
§ SQL 
AND 
HADOOP 
§ A 
Connector 
§ HADOOP 
ON 
SQL 
§ User 
Defined 
Func;ons 
§ SQL 
ON 
HADOOP 
§ Something 
like 
Hive, 
but 
becer
Op3on 
1 
– 
SQL 
AND 
HADOOP 
Run 
SQL-­‐on-­‐SQL 
and 
Hadoop-­‐on-­‐Hadoop 
and 
use 
a 
connector 
to 
join 
the 
two 
systems 
Pros 
§ Minimal 
impact 
(SQL 
and 
Hadoop 
worlds 
can 
func;on 
as 
before) 
§ Easier 
to 
implement 
Cons 
§ Network 
! 
§ Challenge 
of 
op;mising 
across 
two 
technologies 
© 
2014 
EXASOL 
AG
© 
2014 
EXASOL 
AG 
Op3on 
2 
– 
HADOOP 
ON 
SQL 
§ Bring 
Map/Reduce 
into 
the 
Parallel 
database 
§ For 
example 
using 
Java 
User 
Defined 
Func;ons 
select 
my_java_map_func1on(words) 
a_word, 
count(*) 
word_count 
from 
DOCUMENTS 
group 
by 
1 
§ Doesn’t 
benefit 
from 
Hadoop’s 
storage 
advantages
© 
2014 
EXASOL 
AG 
Op3on 
3 
-­‐ 
SQL 
ON 
HADOOP 
Build 
a 
rela;onal 
database 
on 
Hadoop 
storage 
§ Impala 
(Cloudera) 
§ S;nger 
(Hortonworks) 
§ Presto 
(Facebook) 
§ SparkSQL 
(UC 
Berkeley) 
§ HAWQ 
(Pivotal) 
§ BigSQL 
(IBM) 
§ Apache 
Phoenix 
(for 
HBase) 
§ Apache 
Tajo 
§ Apache 
Drill 
§ etc 
etc 
etc 
…. 
AND 
DON‘T 
FORGET 
HIVE 
!
Four 
possible 
market 
outcomes… 
§ Hadoop 
and 
SQL 
databases 
are 
on 
a 
collision 
course 
– 
only 
one 
will 
survive 
© 
2014 
EXASOL 
AG 
§ No 
sign 
of 
that 
so 
far 
§ They 
are 
complementary 
– 
both 
will 
survive 
§ Probably 
-­‐ 
the 
challenge 
is 
how 
to 
make 
them 
work 
together 
§ They 
will 
merge 
and 
become 
one 
§ Some 
indica;ons 
this 
is 
already 
star;ng 
to 
happen 
§ Something 
even 
more 
amazing 
will 
come 
along 
and 
replace 
them 
both 
§ Some;mes 
this 
happens 
– 
Spark 
?
© 
2014 
EXASOL 
AG 
My 
Personal 
Opinionated 
Opinion 
Becer 
to 
use 
a 
tool 
that 
has 
been 
made 
for 
the 
job 
A 
purpose-­‐built 
tool 
will 
always 
beat 
one 
made 
originally 
for 
another 
purpose.
© 
2014 
EXASOL 
AG 
Ques3ons 
? 
My 
contact 
details 
: 
Email 
: 
graham.mossman@exasol.com 
Twicer 
: 
@EXADude

More Related Content

What's hot

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationDataStax Academy
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...Hadoop / Spark Conference Japan
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. AerospikeVolha Banadyseva
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance UpdateCloudera, Inc.
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 

What's hot (20)

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. Aerospike
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 

Similar to Graham Mossman - SQL and high performance computing on Hadoop

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLSQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLBCS Data Management Specialist Group
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloudSteve Loughran
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesRonen Botzer
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Jean-Pierre König
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Caching and JCache with Greg Luck 18.02.16
Caching and JCache with Greg Luck 18.02.16Caching and JCache with Greg Luck 18.02.16
Caching and JCache with Greg Luck 18.02.16Comsysto Reply GmbH
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architectureAdrian Tanase
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
What's Next for Google's BigTable
What's Next for Google's BigTableWhat's Next for Google's BigTable
What's Next for Google's BigTableSqrrl
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...StampedeCon
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
 

Similar to Graham Mossman - SQL and high performance computing on Hadoop (20)

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLSQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloud
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time Architectures
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Caching and JCache with Greg Luck 18.02.16
Caching and JCache with Greg Luck 18.02.16Caching and JCache with Greg Luck 18.02.16
Caching and JCache with Greg Luck 18.02.16
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architecture
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
What's Next for Google's BigTable
What's Next for Google's BigTableWhat's Next for Google's BigTable
What's Next for Google's BigTable
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 

More from huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 

Recently uploaded (20)

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Graham Mossman - SQL and high performance computing on Hadoop

  • 1. SQL and high performance compu3ng on Hadoop Graham Mossman, Senior Solu;on Engineer, EXASOL © 2014 EXASOL AG
  • 2. © 2014 EXASOL AG I Love My Lawnmower ... ... because it cuts my grass well
  • 3. © 2014 EXASOL AG But ... ... it‘s quite a struggle cuBng my hedge
  • 4. © 2014 EXASOL AG And ... ... it isn‘t good at making apple sauce
  • 5. © 2014 EXASOL AG And don‘t even thinking about... ... using it to cut hair
  • 6. © 2014 EXASOL AG Hadoop today is … § S;ll Open Source ! § Began with HDFS and Map/Reduce § Now comprises a number of addi;onal technologies § File systems § (e.g. Tachyon) § Cluster Managers § (e.g. YARN + Mesos) § Execu;on Engines § (e.g. Tez, Spark etc.) § Analy;cal Layer and Applica;ons § (e.g. Hive, Pig, various SQL on Hadoop)
  • 7. © 2014 EXASOL AG Hadoop With Everything § Hadoop was invented to more easily distribute the Nutch and Lucene applica;ons across a cluster of machines. § Map/Reduce – distributed processing § HDFS – distributed file system § Began to be used for …. just about everything. § But not all processing tasks are like indexing the Internet § Hadoop started to acract cri;cism § But usually when it was being used for something it wasn’t designed for
  • 8. © 2014 EXASOL AG Definitely NOT jobs for Hadoop § Word processing § Payroll system § Anything on a single computer § Anything with “small” data
  • 9. © 2014 EXASOL AG Analy3cal Queries § “GROUP BY“ logic § i.e. not concerned with individual data items § Analy;cal Func;ons § MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION … § Table joins, nested sub-­‐queries Usually short-­‐running, ad-­‐hoc and submiced many at a ;me.
  • 10. Map/Reduce and HDFS : the wrong tools for Analy3cs ? § Queries tend to be short : fault tolerance is less important © 2014 EXASOL AG § If chance of failure in a 5 hour batch is 1 in 300 § Chance of failure in a 5 second query is 1 in 1,000,000 § Queries tend to be short : start-­‐up ;me is significant § a 20 second start-­‐up ;me is NOT OK on a 5 second query § A number of projects started to address these issues § e.g. “Hot containers” in Hive on Tez to reduce start-­‐up ;me
  • 11. Map/Reduce: the wrong language for Analy3cs ? Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation © 2014 EXASOL AG Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("t") emit(fields[0], fields[1]); } Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum); } Stage 1: Map-Shuffle Mapper(row) { ... emit(page_views, page_name); } ... shuffle Stage 2: Local data = open("stage1.out") for (i in 0 to 10) { print(data.getNext()) }
  • 12. © 2014 EXASOL AG Equivalent in SQL SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10;
  • 13. © 2014 EXASOL AG The SQL language § Portable § Well-­‐defined standards exist § No detailed knowledge of the plaporm required § e.g. you don’t need to manage memory § SQL is assumed by a lot of repor;ng tools § Widely used and understood even by non-­‐technical people
  • 14. © 2014 EXASOL AG I‘m not saying that SQL is perfect • Try writing the simple Hadoop “Word Count” example in pure SQL • Or try to “sessionise” weblog data • Or anything with data that is not structured • “Which part of STRUCTURED Query Language don’t you understand …?!” • All I’m saying is that is an excellent language for analytical queries.
  • 15. Hadoop could handle SQL (via Hive), but historically … © 2014 EXASOL AG § High Latency § Restricted SQL op;ons § All but simple table joins were difficult § Licle support for compression & indexing § Merv Adrian (Gartner Research -­‐ 2014) § “What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well” § Result : EVERYTHING looked good compared to Hive
  • 16. © 2014 EXASOL AG Everyone s3ll likes to compare themselves to Hive
  • 17. © 2014 EXASOL AG EXASOL being no excep3on !
  • 18. © 2014 EXASOL AG Hive con3nues to be improved … § Completed § Views (HIVE-­‐1143) § Par;;oned Views (HIVE-­‐1941) § Storage Handlers (HIVE-­‐705) § HBase Integra;on § HBase Bulk Load § Locking (HIVE-­‐1293) § Indexes (HIVE-­‐417) § Bitmap Indexes (HIVE-­‐1803) § Filter Pushdown (HIVE-­‐279) § Table-­‐level Sta;s;cs (HIVE-­‐1361) § Dynamic Par;;ons § Binary Data Type (HIVE-­‐2380) § Decimal Precision and Scale Support § HCatalog § HiveServer2 (HIVE-­‐2935) § Column Sta;s;cs in Hive (HIVE-­‐1362) § List Bucke;ng (HIVE-­‐3026) § Group By With Rollup (HIVE-­‐2397) § Enhanced Aggrega;on, Cube, Grouping and Rollup (HIVE-­‐3433) § Op;mizing Skewed Joins (HIVE-­‐3086) § Correla;on Op;mizer (HIVE-­‐2206) § Hive on Tez (HIVE-­‐4660) § Vectorized Query Execu;on (HIVE-­‐4160) § In Progress § Atomic Insert/Update/Delete (HIVE-­‐5317) § Transac;on Manager (HIVE-­‐5843) § Cost Based Op;mizer in Hive (HIVE-­‐5775) § Proposed § Spa;al Queries § Theta Join (HIVE-­‐556) § JDBC Storage Handler § MapJoin Op;miza;on § Proposal to standardize and expand Authoriza;on in Hive § Dependent Tables (HIVE-­‐3466) § AccessServer § Type Qualifiers in Hive § MapJoin & Par;;on Pruning (HIVE-­‐5119) § SQL Standard based secure authoriza;on (HIVE-­‐5837) § Updatable Views (HIVE-­‐1143) § Hive on Spark (HIVE-­‐7292)
  • 19. The dream data architecture for analy3cs … § Based on the SQL language § but leverages Hadoop’s extreme scalability § and Hadoop’s fault tolerance § while not compromising on speed. © 2014 EXASOL AG Could it please also have some maturity ? And be easy to use ?
  • 20. © 2014 EXASOL AG The current reality § SQL on SQL, which is arguably § Less scalable § Less fault tolerant § Less good with unstructured data § SQL on Hadoop, which is arguably § Less mature § Less easy to use § Slower
  • 21. © 2014 EXASOL AG Choices for SQL and Hadoop § SQL AND HADOOP § A Connector § HADOOP ON SQL § User Defined Func;ons § SQL ON HADOOP § Something like Hive, but becer
  • 22. Op3on 1 – SQL AND HADOOP Run SQL-­‐on-­‐SQL and Hadoop-­‐on-­‐Hadoop and use a connector to join the two systems Pros § Minimal impact (SQL and Hadoop worlds can func;on as before) § Easier to implement Cons § Network ! § Challenge of op;mising across two technologies © 2014 EXASOL AG
  • 23. © 2014 EXASOL AG Op3on 2 – HADOOP ON SQL § Bring Map/Reduce into the Parallel database § For example using Java User Defined Func;ons select my_java_map_func1on(words) a_word, count(*) word_count from DOCUMENTS group by 1 § Doesn’t benefit from Hadoop’s storage advantages
  • 24. © 2014 EXASOL AG Op3on 3 -­‐ SQL ON HADOOP Build a rela;onal database on Hadoop storage § Impala (Cloudera) § S;nger (Hortonworks) § Presto (Facebook) § SparkSQL (UC Berkeley) § HAWQ (Pivotal) § BigSQL (IBM) § Apache Phoenix (for HBase) § Apache Tajo § Apache Drill § etc etc etc …. AND DON‘T FORGET HIVE !
  • 25. Four possible market outcomes… § Hadoop and SQL databases are on a collision course – only one will survive © 2014 EXASOL AG § No sign of that so far § They are complementary – both will survive § Probably -­‐ the challenge is how to make them work together § They will merge and become one § Some indica;ons this is already star;ng to happen § Something even more amazing will come along and replace them both § Some;mes this happens – Spark ?
  • 26. © 2014 EXASOL AG My Personal Opinionated Opinion Becer to use a tool that has been made for the job A purpose-­‐built tool will always beat one made originally for another purpose.
  • 27. © 2014 EXASOL AG Ques3ons ? My contact details : Email : graham.mossman@exasol.com Twicer : @EXADude