Session ID:
Prepared by:
Hadoop databases:
Hive, Impala, Spark, Presto
For ORACLE DBAs
557
Maxym Kharchenko, Gluent
@maxymkh
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami
• Database Kernel developer
-> ORACLE DBA
-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Shameless plug about my company
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Agenda
• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start
April 2-6, 2017 in Las Vegas, NV USA #C17LV
What is Hadoop:
• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Yes, but what does it all mean ?
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Imagine that you are Google
in the early 2000s
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Target Ads
• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast
• (reasonably) Cheap
• (reasonably) Easy to use
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Let’s build a Data Warehouse
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(traditional) Data warehouse
• Been there for years
• Mature and
(relatively) advanced
• SQL !!!
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Data Warehouse scorecard
Requirements RDBMS
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data ¯_(ツ)_/¯
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Scaling up “Big data” ain’t cheap
• Can’t fit all of the data
on a single box
• Cost is quickly
getting out of hand
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(cheap) Commodity systems
make “big data” feasible
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Solution = commodity systems
=
$$$$$ $$
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Commodity systems scorecard
Requirements Commodity
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data   
April 2-6, 2017 in Las Vegas, NV USA #C17LV
All your queries are Java Classes
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Google
• 2003:
Google File System
(GFS) paper
• 2004:
Google MapReduce
(MR) paper
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop
• 2006: Hadoop
April 2-6, 2017 in Las Vegas, NV USA #C17LV
”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse
(reasonably) Fast      
(reasonably) Cheap      
(reasonably) Easy to use      
Able to process data    ¯_(ツ)_/¯
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2010: Facebook releases
Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2012: Cloudera announces
Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala
April 2-6, 2017 in Las Vegas, NV USA #C17LV
And then, it exploded …
“Hadoop” vs “Relational”
databases
Demo … hopefully 
April 2-6, 2017 in Las Vegas, NV USA #C17LV
This is not about NoSql :-)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Tables
sql> describe sh.products;
+-----------------------+----------------+---------+
| name | type | comment |
+-----------------------+----------------+---------+
| prod_id | bigint | |
| prod_name | string | |
| prod_desc | string | |
| prod_category_id | bigint | |
| prod_category_desc | string | |
| supplier_id | bigint | |
| prod_total_id | decimal(38,18) | |
| prod_src_id | decimal(38,18) | |
| prod_eff_from | timestamp | |
| prod_eff_to | timestamp | |
| prod_valid | string | |
+-----------------------+----------------+---------+
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Running SQL queries
sql> select prod_id, count(1)
from sh.sales s, sh.channels c
where c.channel_id = s.channel_id
and c.channel_desc='Catalog'
group by prod_id
order by 2 desc
limit 5;
+------------------------+----------+
| prod_id | count(1) |
+------------------------+----------+
| 43.000000000000000000 | 5182 |
| 46.000000000000000000 | 5165 |
| 22.000000000000000000 | 5162 |
| 123.000000000000000000 | 5152 |
| 32.000000000000000000 | 5145 |
+------------------------+----------+
Fetched 5 row(s) in 3.26s
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Queries are optimized
sql> explain select count(1) from sh.times;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
| |
| 03:AGGREGATE [FINALIZE] |
| | output: count:merge(1) |
| | |
| 02:EXCHANGE [UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(1) |
| | |
| 00:SCAN HDFS [sh.times] |
| partitions=16/16 files=32 size=500.45KB |
+----------------------------------------------------------+
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: What gets optimized
• No “regular” indexes
• But many operations
are distributed
SALES 1
TIMES 1
SALES 2
TIMES 2
SALES 3
TIMES 3
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Native cloud filesystem support
sql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Database engine does NOT ”own” data
April 2-6, 2017 in Las Vegas, NV USA #C17LV
example01.dbf
sysaux01.dbf
system01.dbf
temp01.dbf
undotbs01.dbf
users01.dbf
a01_data.parq
a01_data.parq
a03_data.parq
a04_data.parq
a05_data.parq
a06_data.parq
Different: Different engines can work with
the same data files (even at the same time)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: … or copies of the data files
hdfs://adhoc/a.parq
hdfs://adhoc/b.parq
hdfs://adhoc/c.parq
hdfs://adhoc/d.parq
hdfs://adhoc/e.parq
hdfs://adhoc/f.parq
hdfs://prod/a.parq
hdfs://prod/b.parq
hdfs://prod/c.parq
hdfs://prod/d.parq
hdfs://prod/e.parq
hdfs://prod/f.parq
s3://backup/a.parq
s3://backup/b.parq
s3://backup/c.parq
s3://backup/d.parq
s3://backup/e.parq
s3://backup/f.parq
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Open data formats
• Not proprietary – many
tools can read/write
• No additional $$
for “advanced features”:
• Columnar storage
• Storage indexes
• Compression
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: “sqlplus-like” clients
> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+
| prod_id | count(1) |
+-----------------------+----------+
| 48.000000000000000000 | 74026 |
+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'
0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Append only, “ETL-like” DML
• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: some
interpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc
Databases
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Hive
Slave C
• “Designed” for
“batch” queries (*)
• Runs on top of standard
Hadoop RM: YARN
• Supports multiple
“engines”: MR, TEZ,
Spark
• SerDes
YARN
NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN
NM
datanode
Slave C
YARN
NM
datanode
YARN RM
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Slave A
Apache Impala
• Designed for
“quick interactive”
queries
• “Data-local” execution
• In-memory processing
impalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Spark
• “Better Hadoop”
with “native”:
SQL, Mlib, GraphX
• In-memory processing,
based on RDDs
• Supports many clusters:
“native”, YARN, Mesos
• Flexible programming
model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Presto
Slave A
• Designed for
“interactive” queries
• In-memory processing
• Custom storage
“plugins”: Hive, Kafka,
MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator
How to start
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 1: Google “Hadoop ecosystem”
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 2: Try to install the simplest thing
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 3
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 4
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hint: Nobody builds their own Linux
anymore
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Chose Hadoop distribution that suits you
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop distributions
• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database
April 2-6, 2017 in Las Vegas, NV USA #C17LV
So what’s in it for me ?
• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented
Q&A
Please Complete Your
Session Evaluation
Evaluate this session in your COLLABORATE app.
Pull up this session and tap "Session Evaluation"
to complete the survey.
Session ID: 557

Hadoop databases for oracle DBAs

  • 1.
    Session ID: Prepared by: Hadoopdatabases: Hive, Impala, Spark, Presto For ORACLE DBAs 557 Maxym Kharchenko, Gluent @maxymkh
  • 2.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Whoami • Database Kernel developer -> ORACLE DBA -> Database Hadoop/Cloud developer • Worked with ORACLE for the last 15 years • OCM, ORACLE Ace alumni, Amazon alumni • Last year: OLTP -> Hadoop
  • 3.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Shameless plug about my company Gluent Oracle Teradata NoSQL Big Data Sources MSSQL App X App Y App Z
  • 4.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Agenda • What’s Hadoop databases ? • Hive/Impala/Spark vs. ORACLE (hopefully, demo) • Best ways to start
  • 5.
    April 2-6, 2017in Las Vegas, NV USA #C17LV What is Hadoop: • For “Big data” • Can deal with “Unstructured” data • Distributed • Consists of: HDFS + MapReduce • Requires you to write MapReduce jobs, NoSql
  • 6.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Yes, but what does it all mean ?
  • 7.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Imagine that you are Google in the early 2000s
  • 8.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Target Ads • You need to query web crawler data • Which is unbelievably huge • These queries need to be: • (reasonably) Fast • (reasonably) Cheap • (reasonably) Easy to use
  • 9.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Let’s build a Data Warehouse
  • 10.
    April 2-6, 2017in Las Vegas, NV USA #C17LV (traditional) Data warehouse • Been there for years • Mature and (relatively) advanced • SQL !!!
  • 11.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Data Warehouse scorecard Requirements RDBMS (reasonably) Fast    (reasonably) Cheap    (reasonably) Easy to use    Able to process data ¯_(ツ)_/¯
  • 12.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Scaling up “Big data” ain’t cheap • Can’t fit all of the data on a single box • Cost is quickly getting out of hand
  • 13.
    April 2-6, 2017in Las Vegas, NV USA #C17LV (cheap) Commodity systems make “big data” feasible
  • 14.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Solution = commodity systems = $$$$$ $$
  • 15.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Commodity systems scorecard Requirements Commodity (reasonably) Fast    (reasonably) Cheap    (reasonably) Easy to use    Able to process data   
  • 16.
    April 2-6, 2017in Las Vegas, NV USA #C17LV All your queries are Java Classes
  • 17.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Google • 2003: Google File System (GFS) paper • 2004: Google MapReduce (MR) paper
  • 18.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Hadoop • 2006: Hadoop
  • 19.
    April 2-6, 2017in Las Vegas, NV USA #C17LV ”Traditional Data Warehouse” vs. Hadoop Requirements Hadoop Data Warehouse (reasonably) Fast       (reasonably) Cheap       (reasonably) Easy to use       Able to process data    ¯_(ツ)_/¯
  • 20.
    April 2-6, 2017in Las Vegas, NV USA #C17LV • 2010: Facebook releases Apache Hive • SQL on Hadoop ! SQL on Hadoop - Hive
  • 21.
    April 2-6, 2017in Las Vegas, NV USA #C17LV • 2012: Cloudera announces Impala • Faster SQL on Hadoop ! Another SQL on Hadoop - Impala
  • 22.
    April 2-6, 2017in Las Vegas, NV USA #C17LV And then, it exploded …
  • 23.
  • 24.
    April 2-6, 2017in Las Vegas, NV USA #C17LV This is not about NoSql :-)
  • 25.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Same: Tables sql> describe sh.products; +-----------------------+----------------+---------+ | name | type | comment | +-----------------------+----------------+---------+ | prod_id | bigint | | | prod_name | string | | | prod_desc | string | | | prod_category_id | bigint | | | prod_category_desc | string | | | supplier_id | bigint | | | prod_total_id | decimal(38,18) | | | prod_src_id | decimal(38,18) | | | prod_eff_from | timestamp | | | prod_eff_to | timestamp | | | prod_valid | string | | +-----------------------+----------------+---------+
  • 26.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Same: Running SQL queries sql> select prod_id, count(1) from sh.sales s, sh.channels c where c.channel_id = s.channel_id and c.channel_desc='Catalog' group by prod_id order by 2 desc limit 5; +------------------------+----------+ | prod_id | count(1) | +------------------------+----------+ | 43.000000000000000000 | 5182 | | 46.000000000000000000 | 5165 | | 22.000000000000000000 | 5162 | | 123.000000000000000000 | 5152 | | 32.000000000000000000 | 5145 | +------------------------+----------+ Fetched 5 row(s) in 3.26s
  • 27.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Same: Queries are optimized sql> explain select count(1) from sh.times; +----------------------------------------------------------+ | Explain String | +----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=10.00MB VCores=1 | | | | 03:AGGREGATE [FINALIZE] | | | output: count:merge(1) | | | | | 02:EXCHANGE [UNPARTITIONED] | | | | | 01:AGGREGATE | | | output: count(1) | | | | | 00:SCAN HDFS [sh.times] | | partitions=16/16 files=32 size=500.45KB | +----------------------------------------------------------+
  • 28.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Different: What gets optimized • No “regular” indexes • But many operations are distributed SALES 1 TIMES 1 SALES 2 TIMES 2 SALES 3 TIMES 3
  • 29.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Different: Native cloud filesystem support sql> show partition sh.sales; s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
  • 30.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Database engine does NOT ”own” data
  • 31.
    April 2-6, 2017in Las Vegas, NV USA #C17LV example01.dbf sysaux01.dbf system01.dbf temp01.dbf undotbs01.dbf users01.dbf a01_data.parq a01_data.parq a03_data.parq a04_data.parq a05_data.parq a06_data.parq Different: Different engines can work with the same data files (even at the same time)
  • 32.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Different: … or copies of the data files hdfs://adhoc/a.parq hdfs://adhoc/b.parq hdfs://adhoc/c.parq hdfs://adhoc/d.parq hdfs://adhoc/e.parq hdfs://adhoc/f.parq hdfs://prod/a.parq hdfs://prod/b.parq hdfs://prod/c.parq hdfs://prod/d.parq hdfs://prod/e.parq hdfs://prod/f.parq s3://backup/a.parq s3://backup/b.parq s3://backup/c.parq s3://backup/d.parq s3://backup/e.parq s3://backup/f.parq
  • 33.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Different: Open data formats • Not proprietary – many tools can read/write • No additional $$ for “advanced features”: • Columnar storage • Storage indexes • Compression
  • 34.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Same: “sqlplus-like” clients > impala-shell -i 10.0.0.1 [10.0.0.1:21000] > select prod_id, count(1) from sh.sales group by prod_id order by 2 desc limit 1; +-----------------------+----------+ | prod_id | count(1) | +-----------------------+----------+ | 48.000000000000000000 | 74026 | +-----------------------+----------+ > beeline –u 'jdbc:hive2://10.0.0.1:10000' 0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1) from sh.sales group by prod_id order by 2 desc limit 1;
  • 35.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Different: External dictionary User data Dictionary (SYS) User data Dictionary (SYS) Hive Metastore
  • 36.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Different: Append only, “ETL-like” DML • Hadoop DML is more like ETL • Data is presumed static • ACID: some interpretation required • Schema on read UPDATE t SET a=12 WHERE b=1; Table T (base): a_data.orc Table T (base): a_data.orc Table T (delta): b_data.orc Compactor runs … Table T (base): c_data.orc
  • 37.
  • 38.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Apache Hive Slave C • “Designed” for “batch” queries (*) • Runs on top of standard Hadoop RM: YARN • Supports multiple “engines”: MR, TEZ, Spark • SerDes YARN NM datanode Master Hiveserver2 namenode Slave C YARN NM datanode Slave C YARN NM datanode YARN RM
  • 39.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Slave A Apache Impala • Designed for “quick interactive” queries • “Data-local” execution • In-memory processing impalad datanode Slave B impalad datanode Slave C impalad datanode Master statestored namenode catalogd
  • 40.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Apache Spark • “Better Hadoop” with “native”: SQL, Mlib, GraphX • In-memory processing, based on RDDs • Supports many clusters: “native”, YARN, Mesos • Flexible programming model Master Driver Slave A Executor Slave B Executor Slave C Executor
  • 41.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Presto Slave A • Designed for “interactive” queries • In-memory processing • Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker Slave B worker Slave C worker Master coordinator
  • 42.
  • 43.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Step 1: Google “Hadoop ecosystem”
  • 44.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Step 2: Try to install the simplest thing
  • 45.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Step 3
  • 46.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Step 4
  • 47.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Hint: Nobody builds their own Linux anymore
  • 48.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Chose Hadoop distribution that suits you
  • 49.
    April 2-6, 2017in Las Vegas, NV USA #C17LV Hadoop distributions • Pre-built and pre-integrated (aka: all things work out of the box) • Each has their own “philosophy” … • … As well as preferred Hadoop database
  • 50.
    April 2-6, 2017in Las Vegas, NV USA #C17LV So what’s in it for me ? • It’s interesting (cool technology that hits many recent buzzwords) • If you know ORACLE, it’s close to your skill set • It’s promising and future oriented
  • 51.
  • 52.
    Please Complete Your SessionEvaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey. Session ID: 557

Editor's Notes

  • #16 All your queries are essentially Java Classes
  • #18 With Apache Hadoop, “everybody” can run “big data” queries
  • #19 With Apache Hadoop, “everybody” can run “big data” queries
  • #26 Columns and Rows Schemas Similar data types Other familiar objects: Views (*), Functions (**) Notably missing: Indexes
  • #27 Joins, Subqueries, aggregate functions Optimizer, statistics Different SQL dialects
  • #28 Joins, Subqueries, aggregate functions Optimizer, statistics Different SQL dialects
  • #34 Different databases support different formats. Some (i.e. Hive) support ”hookups” to support custom formats
  • #36 Some dictionary information (i.e. partitions) can be read directly from the file system
  • #37 Each database supports different DML semantics Apache Kudu is coming to change that
  • #39 Java Old and Trusty
  • #40 Not Hadoop C++
  • #41 Better Hadoop Different apps use different executors
  • #42 Does not spill to disk, not as stable as Hive https://www.quora.com/What-are-the-main-differences-between-Hive-and-Facebook-Presto
  • #44 Technically, not Hadoop Different apps use different executors
  • #45 Technically, not Hadoop Different apps use different executors
  • #46 Technically, not Hadoop Different apps use different executors
  • #47 Technically, not Hadoop Different apps use different executors