SQL on Hadoop

SQL on Hadoop
by Doron Vainrub

I Have Big Data…
… Now what?
▪ We found a great way to handle the 3Vs with Hadoop’s
HDFS
▪ How can we query all this data?
▪ How can we make the data accessible to people with less
programming knowledge like researchers and data
scientists?
< 2 >SQL on Hadoop

Example using Hue
< 5 >SQL on Hadoop

Hive vs. RDBMS
< 6 >SQL on Hadoop

Hive vs. RDBMS
< 7 >SQL on Hadoop
RDBMS Hive
Data Volume ~ 10-100 GB ~ 1TB - 1PB
Schema on Write on Read
Scalability Rarely beyond 20 nodes To hundreds of nodes
Hardware Often built on proprietary hardware Commodity hardware (= Cheap)
Updates/Deletes Allowed Allowed, but not recommended
Insertion Policy Single/Bulk Inserts Bulk inserts

ACID Properties
< 8 >SQL on Hadoop
▪ Atomicity
- Partition loads are atomic through directory renames in HDFS
▪ Consistency
- Ensured by HDFS. All nodes see the same partitions at all times
- Immutable data = no update or delete consistency issues
▪ Isolation
- Read committed with an exception for partition deletes
- Partitions can be deleted during queries. New partitions will not be seen by jobs
started before the partition add
▪ Durability
- Data is durable in HDFS before partition is exposed to Hive

Hive Challenges
▪ Data growth
▪ Schema flexibility and evolution
▪ Extensibility
▪ Performance
< 9 >SQL on Hadoop

Hive Features
< 10 >SQL on Hadoop
▪ DDL - Create table (internal or external), view, index
▪ Select, where clause, group by, order by, joins, nested queries, describe, insert
▪ Complex data types
▪ Partitioning, sampling, bucketing
▪ Pluggable user defined functions: UDF, UDAF, UDTF
▪ Pluggable custom Input/Output format
▪ Pluggable SerDe libraries
▪ Integration to other services with Storage Handlers
▪ Different options for Loading Data into Hive

File Formats
▪ Hive natively supports TextFile, SequenceFile, RCFile, ORC and Parquet file
formats
▪ Parquet is a columnar format that can improve query performance:
< 11 >SQL on Hadoop

Join in Hive with MapReduce
< 13 >SQL on Hadoop

Query Example
< 26 >SQL on Hadoop

Things you should know
▪ After creating a table with Hive, dropping one, performing HDFS’s rebalance or
deleting data files, you must execute the following command in Impala so it
recognizes the changes:
invalidate metadata <table_name>
▪ When altering a table (add a partition, change location, change permissions on
files, etc.), you must refresh Impala Daemons:
refresh <table_name>
< 27 >SQL on Hadoop

Things you should know
▪ You can use the explain, profile and summary commands to debug a query plan
or it’s execution
▪ Always filter by DT partition (when it exists)
▪ For optimal performances on a table, you must compute statistics on the table
on a daily basis:
compute stats <table_name>
< 28 >SQL on Hadoop

Impala Architecture
Impala’s service consists of the
following components:
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 30 >SQL on Hadoop

Impala Architecture
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 31 >SQL on Hadoop

Impala Architecture
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 32 >SQL on Hadoop

Impala Architecture
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 33 >SQL on Hadoop

Impala and the Metastore
▪ Impala uses existing Hive infrastructure – the metastore
▪ Maintains information about table definitions in the metastore
▪ Caches all table metadata to reuse for future queries
▪ Each impala Daemon contains the latest metadata
< 34 >SQL on Hadoop

Query Execution
< 35 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Impala Daemon Impala Daemon

Query Execution
< 36 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS

Query Execution
< 37 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS

Query Execution
< 38 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS

Query Execution
< 39 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS

Query Execution
< 40 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS

Query Execution
< 41 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS

SQL on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SQL on Hadoop

Similar to SQL on Hadoop (20)

Recently uploaded

Recently uploaded (20)

SQL on Hadoop