SlideShare a Scribd company logo
SQL on Hadoop
by Doron Vainrub
I Have Big Data…
… Now what?
▪ We found a great way to handle the 3Vs with Hadoop’s
HDFS
▪ How can we query all this data?
▪ How can we make the data accessible to people with less
programming knowledge like researchers and data
scientists?
< 2 >SQL on Hadoop
Example using Hue
< 5 >SQL on Hadoop
Hive vs. RDBMS
< 6 >SQL on Hadoop
Hive vs. RDBMS
< 7 >SQL on Hadoop
RDBMS Hive
Data Volume ~ 10-100 GB ~ 1TB - 1PB
Schema on Write on Read
Scalability Rarely beyond 20 nodes To hundreds of nodes
Hardware Often built on proprietary hardware Commodity hardware (= Cheap)
Updates/Deletes Allowed Allowed, but not recommended
Insertion Policy Single/Bulk Inserts Bulk inserts
ACID Properties
< 8 >SQL on Hadoop
▪ Atomicity
- Partition loads are atomic through directory renames in HDFS
▪ Consistency
- Ensured by HDFS. All nodes see the same partitions at all times
- Immutable data = no update or delete consistency issues
▪ Isolation
- Read committed with an exception for partition deletes
- Partitions can be deleted during queries. New partitions will not be seen by jobs
started before the partition add
▪ Durability
- Data is durable in HDFS before partition is exposed to Hive
Hive Challenges
▪ Data growth
▪ Schema flexibility and evolution
▪ Extensibility
▪ Performance
< 9 >SQL on Hadoop
Hive Features
< 10 >SQL on Hadoop
▪ DDL - Create table (internal or external), view, index
▪ Select, where clause, group by, order by, joins, nested queries, describe, insert
▪ Complex data types
▪ Partitioning, sampling, bucketing
▪ Pluggable user defined functions: UDF, UDAF, UDTF
▪ Pluggable custom Input/Output format
▪ Pluggable SerDe libraries
▪ Integration to other services with Storage Handlers
▪ Different options for Loading Data into Hive
File Formats
▪ Hive natively supports TextFile, SequenceFile, RCFile, ORC and Parquet file
formats
▪ Parquet is a columnar format that can improve query performance:
< 11 >SQL on Hadoop
Join in Hive with MapReduce
< 13 >SQL on Hadoop
Hive Architecture
Query Example
< 26 >SQL on Hadoop
Things you should know
▪ After creating a table with Hive, dropping one, performing HDFS’s rebalance or
deleting data files, you must execute the following command in Impala so it
recognizes the changes:
invalidate metadata <table_name>
▪ When altering a table (add a partition, change location, change permissions on
files, etc.), you must refresh Impala Daemons:
refresh <table_name>
< 27 >SQL on Hadoop
Things you should know
▪ You can use the explain, profile and summary commands to debug a query plan
or it’s execution
▪ Always filter by DT partition (when it exists)
▪ For optimal performances on a table, you must compute statistics on the table
on a daily basis:
compute stats <table_name>
< 28 >SQL on Hadoop
Impala Architecture
Impala’s service consists of the
following components:
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 30 >SQL on Hadoop
Impala Architecture
Impala’s service consists of the
following components:
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 31 >SQL on Hadoop
Impala Architecture
Impala’s service consists of the
following components:
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 32 >SQL on Hadoop
Impala Architecture
Impala’s service consists of the
following components:
▪ Impala Daemon
▪ Statestore
▪ Catalog Server
< 33 >SQL on Hadoop
Impala and the Metastore
▪ Impala uses existing Hive infrastructure – the metastore
▪ Maintains information about table definitions in the metastore
▪ Caches all table metadata to reuse for future queries
▪ Each impala Daemon contains the latest metadata
< 34 >SQL on Hadoop
Query Execution
< 35 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Impala Daemon Impala Daemon
Query Execution
< 36 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Execution
< 37 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Execution
< 38 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Execution
< 39 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Execution
< 40 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Execution
< 41 >SQL on Hadoop
UI
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
Query Planner
Query Coordinator
Query Executor
HDFS
vs.
< 43 >SQL on Hadoop
< 44 >SQL on Hadoop
< 45 >SQL on Hadoop
SQL on Hadoop
SQL on Hadoop

More Related Content

What's hot

Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
Leyi (Kamus) Zhang
 
Oracle database 12c intro
Oracle database 12c introOracle database 12c intro
Oracle database 12c intro
pasalapudi
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Cloudera, Inc.
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Zohar Elkayam
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
Marco Gralike
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Microsoft azure database offerings
Microsoft azure database offeringsMicrosoft azure database offerings
Microsoft azure database offerings
Guruprasad Vijayarao
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
nvvrajesh
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
Yahoo Developer Network
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
Paresh Nayak,OCP®,Prince2®
 
Oracle Database 12c Release 2 - New Features On Oracle Database Exadata Expr...
Oracle Database 12c Release 2 - New Features On Oracle Database Exadata  Expr...Oracle Database 12c Release 2 - New Features On Oracle Database Exadata  Expr...
Oracle Database 12c Release 2 - New Features On Oracle Database Exadata Expr...
Alex Zaballa
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
Maxym Kharchenko
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
Tony Rogerson
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
Yue Chen
 
DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)
Gustavo Rene Antunez
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
Dinesh Chitlangia
 

What's hot (20)

Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
Oracle database 12c intro
Oracle database 12c introOracle database 12c intro
Oracle database 12c intro
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Microsoft azure database offerings
Microsoft azure database offeringsMicrosoft azure database offerings
Microsoft azure database offerings
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
 
Oracle Database 12c Release 2 - New Features On Oracle Database Exadata Expr...
Oracle Database 12c Release 2 - New Features On Oracle Database Exadata  Expr...Oracle Database 12c Release 2 - New Features On Oracle Database Exadata  Expr...
Oracle Database 12c Release 2 - New Features On Oracle Database Exadata Expr...
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 

Similar to SQL on Hadoop

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
ssuser8c3ea7
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
QA or the Highway
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
Gruter
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
Jan Pieter Posthuma
 

Similar to SQL on Hadoop (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 

Recently uploaded

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
anfaltahir1010
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 

Recently uploaded (20)

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 

SQL on Hadoop

  • 1. SQL on Hadoop by Doron Vainrub
  • 2. I Have Big Data… … Now what? ▪ We found a great way to handle the 3Vs with Hadoop’s HDFS ▪ How can we query all this data? ▪ How can we make the data accessible to people with less programming knowledge like researchers and data scientists? < 2 >SQL on Hadoop
  • 3.
  • 4.
  • 5. Example using Hue < 5 >SQL on Hadoop
  • 6. Hive vs. RDBMS < 6 >SQL on Hadoop
  • 7. Hive vs. RDBMS < 7 >SQL on Hadoop RDBMS Hive Data Volume ~ 10-100 GB ~ 1TB - 1PB Schema on Write on Read Scalability Rarely beyond 20 nodes To hundreds of nodes Hardware Often built on proprietary hardware Commodity hardware (= Cheap) Updates/Deletes Allowed Allowed, but not recommended Insertion Policy Single/Bulk Inserts Bulk inserts
  • 8. ACID Properties < 8 >SQL on Hadoop ▪ Atomicity - Partition loads are atomic through directory renames in HDFS ▪ Consistency - Ensured by HDFS. All nodes see the same partitions at all times - Immutable data = no update or delete consistency issues ▪ Isolation - Read committed with an exception for partition deletes - Partitions can be deleted during queries. New partitions will not be seen by jobs started before the partition add ▪ Durability - Data is durable in HDFS before partition is exposed to Hive
  • 9. Hive Challenges ▪ Data growth ▪ Schema flexibility and evolution ▪ Extensibility ▪ Performance < 9 >SQL on Hadoop
  • 10. Hive Features < 10 >SQL on Hadoop ▪ DDL - Create table (internal or external), view, index ▪ Select, where clause, group by, order by, joins, nested queries, describe, insert ▪ Complex data types ▪ Partitioning, sampling, bucketing ▪ Pluggable user defined functions: UDF, UDAF, UDTF ▪ Pluggable custom Input/Output format ▪ Pluggable SerDe libraries ▪ Integration to other services with Storage Handlers ▪ Different options for Loading Data into Hive
  • 11. File Formats ▪ Hive natively supports TextFile, SequenceFile, RCFile, ORC and Parquet file formats ▪ Parquet is a columnar format that can improve query performance: < 11 >SQL on Hadoop
  • 12.
  • 13. Join in Hive with MapReduce < 13 >SQL on Hadoop
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. Query Example < 26 >SQL on Hadoop
  • 27. Things you should know ▪ After creating a table with Hive, dropping one, performing HDFS’s rebalance or deleting data files, you must execute the following command in Impala so it recognizes the changes: invalidate metadata <table_name> ▪ When altering a table (add a partition, change location, change permissions on files, etc.), you must refresh Impala Daemons: refresh <table_name> < 27 >SQL on Hadoop
  • 28. Things you should know ▪ You can use the explain, profile and summary commands to debug a query plan or it’s execution ▪ Always filter by DT partition (when it exists) ▪ For optimal performances on a table, you must compute statistics on the table on a daily basis: compute stats <table_name> < 28 >SQL on Hadoop
  • 29.
  • 30. Impala Architecture Impala’s service consists of the following components: ▪ Impala Daemon ▪ Statestore ▪ Catalog Server < 30 >SQL on Hadoop
  • 31. Impala Architecture Impala’s service consists of the following components: ▪ Impala Daemon ▪ Statestore ▪ Catalog Server < 31 >SQL on Hadoop
  • 32. Impala Architecture Impala’s service consists of the following components: ▪ Impala Daemon ▪ Statestore ▪ Catalog Server < 32 >SQL on Hadoop
  • 33. Impala Architecture Impala’s service consists of the following components: ▪ Impala Daemon ▪ Statestore ▪ Catalog Server < 33 >SQL on Hadoop
  • 34. Impala and the Metastore ▪ Impala uses existing Hive infrastructure – the metastore ▪ Maintains information about table definitions in the metastore ▪ Caches all table metadata to reuse for future queries ▪ Each impala Daemon contains the latest metadata < 34 >SQL on Hadoop
  • 35. Query Execution < 35 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Impala Daemon Impala Daemon
  • 36. Query Execution < 36 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS
  • 37. Query Execution < 37 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS
  • 38. Query Execution < 38 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS
  • 39. Query Execution < 39 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS
  • 40. Query Execution < 40 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS
  • 41. Query Execution < 41 >SQL on Hadoop UI Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS
  • 42. vs.
  • 43. < 43 >SQL on Hadoop
  • 44. < 44 >SQL on Hadoop
  • 45. < 45 >SQL on Hadoop