SlideShare a Scribd company logo
SQL on Hadoop
Markus Olenius
BIGDATAPUMP Ltd
• Big Data vs. Traditional Data Warehousing
• Increased volumes and scale
• Lower costs per node
• Vendor lock-in avoidance, open source preference common
• Agility over correctness of data
• Schema on read vs. schema on write
• Hadoop and MapReduce processing
– support for scale, but with limitations
• No support for BI/ETL/IDE –tools
• Missing features (optimizer, indexes, views)
• Slow due to MapReduce latency
• No schemas
• Competence gap
Background
What is SQL on Hadoop?
”A class of analytical application tools that
combine established SQL-style querying with
newer Hadoop data framework elements”
Classic Hive Architecture
Hive Server Hive MetastoreUser Client
Datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
Table to Files
Table to Format
SQL to MapReduce
compiler
Datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
Datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
Rule based
optimizer
MR Plan execution
coordinator
Catalog
Metadata
HiveQL
queries
Hive CLI
BI, ETL, IDEs
ODBC/JDBC
Hive Classic
Hive Classic
• HiveQL is a SQL-based language, runs on top of MapReduce
• Support for JDBC/ODBC
• Familiar SQL like interface
• Economical processing for petabytes
• Logical schema – schema on read
• Missing features:
• ANSI SQL
• Cost based optimizer
• Data Types
• Security
• Hive Classic tied to MapReduce processing, leading to latency overhead
 Slow but OK for batch SQL
RDBMS on Hadoop
Relying on PostgreSQL
Others
Data Virtualization
Hive and
Enhancements
SQL outside
Hadoop
Relying on HBase
MPP Query Engines
SQL on Hadoop Landscape
CitusDB
Impala
Apache Kylin
Big SQL
PolyBase
Vortex
1. MPP query engines
• Don’t utilize MapReduce or utilize it only in some specific cases
• Usually able to utilize Hive MetaStore
• Typically support for ODBC and/or JDBC
2. RDBMS on Hadoop
• RDBMS sharing nodes with Hadoop and utilizing HDFS
• Typically proprietary file format for best performance
• May require dedicated edge nodes
3. SQL outside Hadoop
• RDBMS sends request to Hadoop
• Processing work shared between Hadoop and RDBMS
• Performance depends on network and RDBMS load
4. Data Virtualization solutions
• Typically utilize Hive and possibly other MPP query engines
• Similar use cases with SQL outside Hadoop
Remarks on solution
categories
1. Minimize I/O to get needed data
• Partitioning and indexing
• Using columnar file formats
• Caching and in-memory processing
2. Query execution optimization
• Better optimization plans, cost based optimization
• Better query execution
• Combining intermediate results more efficiently
• Batch (MapReduce) vs. Streaming (Impala, HAWQ) vs. Hybrid (Spark)
approaches
Need for Speed
- methods for getting better performance
Example – Impala Architecture
File format selection
• Best performance with compressed and columnar
data
• Trade-off - time and resources used for file format
transformations
• Optimal performance and compression may
require using proprietary file formats
• File format alternatives:
• CSV, Avro, JSON
• Columnar: ORC, Parquet, RCFile
• Proprietary formats
(e.g. Vertica, JethroData)
Overview of a few
interesting tools
Hive 0.13
• from MapReduce to Tez
• Optimized for ORC file
format
• Decimal, varchar, date
• Interactive queries
Stinger.next (Hive 0.14)
• ACID transactions
• Cost based query
optimization
Open Source
Stinger InitiativeMainly supported by: Hortonworks
Stinger Initiative, Hive 0.13
Stinger.next, Hive 0.14
ImpalaMainly supported by: Cloudera
• An open source distributed SQL engine that works directly with HDFS
• Architected specifically to leverage the flexibility and scalability strengths of Hadoop
• Used query language is compatible with HiveQL and uses the same metadata store as Apache Hive
• Built in functions:
• Mathematical operations
• String manipulation
• Type conversion
• Date operations
• Available interfaces:
• A command line interface
• Hue, the Hadoop management GUI
• ODBC
• JDBC
• Supported file types:
• Text files
• Hadoop sequence files
• Avro
• Parquet
Open Source
• Operations that are not supported:
• Hive user defined functions (UDFs)
• Hive indexes
• Deleting individual rows (overwriting data in tables and
partions is possible)
Impala 2.1 and Beyond (Ships in 2015)
• Nested data – enables queries on complex nested structures including
maps, structs, and arrays (early 2015)
• MERGE statement – enables merging in updates into existing tables
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING
SET
• SQL SET operators – MINUS, INTERSECT
• Apache HBase CRUD – allows use of Impala for inserts and updates into
HBase
• UDTFs (user-defined table functions) Intra-node parallelized aggregations
and joins – to provide even faster joins and aggregations on on top of the
performance gains of Impala
• Parquet enhancements – continued performance gains including index
pages
• Amazon S3 integration
SparkSQL
• “Next generation Shark”
• Able to use existing Hive metastores, SerDes, and UDFs
• Integrated APIs in Python, Scala and Java
• JDBC and ODBC connectivity
• In-memory column store data caching
• Good support for Parquet, JSON
• Open Source
Apache Drill
Mainly supported by: MapR
• Open source, low latency SQL query for Hadoop and
NoSQL
• Agile:
• Self-service data exploration capabilities on data stored in
multiple formats in files or NoSQL databases
• Metadata definitions in a centralized store are not required
• Flexible:
• Hierarchical columnar representation of data allows high
performance queries
• Conforms to the ANSI SQL standards
• ODBC connector is used when integrated with BI tools
• Open Source
HP Vertica
• Enterprise-ready SQL queries on Hadoop data
• Features included:
• Database designer
• Management console
• Workload management
• Flex tables
• External tables
• Backup functionality
• Grid-based, columnar DBMS for data warehousing
• Handles changing workloads as elastically as the cloud
• Replication, failover and recovery in the cloud is provided
• YARN not yet supported
• Data Model: Relational structured
• Transaction Model: ACID
• Commercial: Pay per term, per node
• Features that are not included:
• Geospatial functions
• Live aggregate projections
• Time series analytics
• Text search
• Advanced analytics packages
JethroData
• Index based, all data is indexed
• Columnar proprietary file format
• Supports use with AWS S3 data
• Running on dedicated edge nodes besides Hadoop
cluster
• Supports use with Qlik, Tableau or MicroStrategy
through ODBC/JDBC
• Commercial
Demo with Qlik:
http://jethrodata.qlik.com/hub/stream/aaec8d41-5201-43ab-809f-3063750dfafd
Summary
Tool selection based on
use case requirements
https://www.mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details
Tool Focus
http://www.datasalt.com/2014/04/sql-on-hadoop-state-of-the-art/
https://www.mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details
• Long list of alternatives to choose from
• Many tools claim to be fastest in market (self-done evaluation tests)
• SQL and data type support varies
 It really depends on what your requirements for this capability are – at
least for batch and interactive SQL use cases there are many tools that
are probably “good enough”
 Doing testing and evaluation with actual use cases is typically needed
 Start with tools recommended or certified by selected Hadoop distribution
 Consider also amount of concurrent use, not only single query
performance
 SQL-on-Hadoop capabilities are usually not a differentiator that should
guide selection of Hadoop distribution
How to select correct tool for your
SQL-on-Hadoop use cases?
BIGDATAPUMP LTD
WWW.BIGDATAPUMP.COM

More Related Content

What's hot

JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
JethroData
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
IBM Analytics
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
Nojan Emad
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
Remy Rosenbaum
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
markgrover
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 

What's hot (20)

JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 

Viewers also liked

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
nvvrajesh
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
Gruter
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Radenko Zec
 
map.geo.admin.ch en 3D
map.geo.admin.ch en 3Dmap.geo.admin.ch en 3D
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
Luke Han
 
Pitch for ESA's Digital Agenda
Pitch for ESA's Digital Agenda Pitch for ESA's Digital Agenda
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
Pdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pigPdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pig
Michelle Hong
 
인메모리 클러스터링 아키텍처
인메모리 클러스터링 아키텍처인메모리 클러스터링 아키텍처
인메모리 클러스터링 아키텍처
Jaehong Cheon
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
Nicolas Morales
 
머신 러닝(Machine Learning)
머신 러닝(Machine Learning)머신 러닝(Machine Learning)
머신 러닝(Machine Learning)
BoYoung Lee
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Matthew (정재화)
 
Ai(인공지능) & ML(머신러닝) 101 Part1
Ai(인공지능) & ML(머신러닝) 101 Part1Ai(인공지능) & ML(머신러닝) 101 Part1
Ai(인공지능) & ML(머신러닝) 101 Part1
Donghan Kim
 
20160409 microsoft 세미나 머신러닝관련 발표자료
20160409 microsoft 세미나 머신러닝관련 발표자료20160409 microsoft 세미나 머신러닝관련 발표자료
20160409 microsoft 세미나 머신러닝관련 발표자료
JungGeun Lee
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기
Ted Won
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
Jinwon Lee
 

Viewers also liked (20)

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
map.geo.admin.ch en 3D
map.geo.admin.ch en 3Dmap.geo.admin.ch en 3D
map.geo.admin.ch en 3D
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
Pitch for ESA's Digital Agenda
Pitch for ESA's Digital Agenda Pitch for ESA's Digital Agenda
Pitch for ESA's Digital Agenda
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Pdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pigPdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pig
 
인메모리 클러스터링 아키텍처
인메모리 클러스터링 아키텍처인메모리 클러스터링 아키텍처
인메모리 클러스터링 아키텍처
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
머신 러닝(Machine Learning)
머신 러닝(Machine Learning)머신 러닝(Machine Learning)
머신 러닝(Machine Learning)
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
 
Ai(인공지능) & ML(머신러닝) 101 Part1
Ai(인공지능) & ML(머신러닝) 101 Part1Ai(인공지능) & ML(머신러닝) 101 Part1
Ai(인공지능) & ML(머신러닝) 101 Part1
 
20160409 microsoft 세미나 머신러닝관련 발표자료
20160409 microsoft 세미나 머신러닝관련 발표자료20160409 microsoft 세미나 머신러닝관련 발표자료
20160409 microsoft 세미나 머신러닝관련 발표자료
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 
Xerox certification
Xerox certificationXerox certification
Xerox certification
 
Presentation1
Presentation1Presentation1
Presentation1
 

Similar to SQL on Hadoop

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Serendio Inc.
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
Some corner at the Laboratory
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 

Similar to SQL on Hadoop (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Recently uploaded

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 

Recently uploaded (20)

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 

SQL on Hadoop

  • 1. SQL on Hadoop Markus Olenius BIGDATAPUMP Ltd
  • 2. • Big Data vs. Traditional Data Warehousing • Increased volumes and scale • Lower costs per node • Vendor lock-in avoidance, open source preference common • Agility over correctness of data • Schema on read vs. schema on write • Hadoop and MapReduce processing – support for scale, but with limitations • No support for BI/ETL/IDE –tools • Missing features (optimizer, indexes, views) • Slow due to MapReduce latency • No schemas • Competence gap Background
  • 3. What is SQL on Hadoop? ”A class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements”
  • 4. Classic Hive Architecture Hive Server Hive MetastoreUser Client Datanode HDFS Map/Reduce Hive Operators Hive SerDes Table to Files Table to Format SQL to MapReduce compiler Datanode HDFS Map/Reduce Hive Operators Hive SerDes Datanode HDFS Map/Reduce Hive Operators Hive SerDes Rule based optimizer MR Plan execution coordinator Catalog Metadata HiveQL queries Hive CLI BI, ETL, IDEs ODBC/JDBC
  • 5. Hive Classic Hive Classic • HiveQL is a SQL-based language, runs on top of MapReduce • Support for JDBC/ODBC • Familiar SQL like interface • Economical processing for petabytes • Logical schema – schema on read • Missing features: • ANSI SQL • Cost based optimizer • Data Types • Security • Hive Classic tied to MapReduce processing, leading to latency overhead  Slow but OK for batch SQL
  • 6. RDBMS on Hadoop Relying on PostgreSQL Others Data Virtualization Hive and Enhancements SQL outside Hadoop Relying on HBase MPP Query Engines SQL on Hadoop Landscape CitusDB Impala Apache Kylin Big SQL PolyBase Vortex
  • 7. 1. MPP query engines • Don’t utilize MapReduce or utilize it only in some specific cases • Usually able to utilize Hive MetaStore • Typically support for ODBC and/or JDBC 2. RDBMS on Hadoop • RDBMS sharing nodes with Hadoop and utilizing HDFS • Typically proprietary file format for best performance • May require dedicated edge nodes 3. SQL outside Hadoop • RDBMS sends request to Hadoop • Processing work shared between Hadoop and RDBMS • Performance depends on network and RDBMS load 4. Data Virtualization solutions • Typically utilize Hive and possibly other MPP query engines • Similar use cases with SQL outside Hadoop Remarks on solution categories
  • 8. 1. Minimize I/O to get needed data • Partitioning and indexing • Using columnar file formats • Caching and in-memory processing 2. Query execution optimization • Better optimization plans, cost based optimization • Better query execution • Combining intermediate results more efficiently • Batch (MapReduce) vs. Streaming (Impala, HAWQ) vs. Hybrid (Spark) approaches Need for Speed - methods for getting better performance
  • 9. Example – Impala Architecture
  • 10. File format selection • Best performance with compressed and columnar data • Trade-off - time and resources used for file format transformations • Optimal performance and compression may require using proprietary file formats • File format alternatives: • CSV, Avro, JSON • Columnar: ORC, Parquet, RCFile • Proprietary formats (e.g. Vertica, JethroData)
  • 11. Overview of a few interesting tools
  • 12. Hive 0.13 • from MapReduce to Tez • Optimized for ORC file format • Decimal, varchar, date • Interactive queries Stinger.next (Hive 0.14) • ACID transactions • Cost based query optimization Open Source Stinger InitiativeMainly supported by: Hortonworks
  • 15. ImpalaMainly supported by: Cloudera • An open source distributed SQL engine that works directly with HDFS • Architected specifically to leverage the flexibility and scalability strengths of Hadoop • Used query language is compatible with HiveQL and uses the same metadata store as Apache Hive • Built in functions: • Mathematical operations • String manipulation • Type conversion • Date operations • Available interfaces: • A command line interface • Hue, the Hadoop management GUI • ODBC • JDBC • Supported file types: • Text files • Hadoop sequence files • Avro • Parquet Open Source • Operations that are not supported: • Hive user defined functions (UDFs) • Hive indexes • Deleting individual rows (overwriting data in tables and partions is possible) Impala 2.1 and Beyond (Ships in 2015) • Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015) • MERGE statement – enables merging in updates into existing tables • Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET • SQL SET operators – MINUS, INTERSECT • Apache HBase CRUD – allows use of Impala for inserts and updates into HBase • UDTFs (user-defined table functions) Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala • Parquet enhancements – continued performance gains including index pages • Amazon S3 integration
  • 16. SparkSQL • “Next generation Shark” • Able to use existing Hive metastores, SerDes, and UDFs • Integrated APIs in Python, Scala and Java • JDBC and ODBC connectivity • In-memory column store data caching • Good support for Parquet, JSON • Open Source
  • 17. Apache Drill Mainly supported by: MapR • Open source, low latency SQL query for Hadoop and NoSQL • Agile: • Self-service data exploration capabilities on data stored in multiple formats in files or NoSQL databases • Metadata definitions in a centralized store are not required • Flexible: • Hierarchical columnar representation of data allows high performance queries • Conforms to the ANSI SQL standards • ODBC connector is used when integrated with BI tools • Open Source
  • 18. HP Vertica • Enterprise-ready SQL queries on Hadoop data • Features included: • Database designer • Management console • Workload management • Flex tables • External tables • Backup functionality • Grid-based, columnar DBMS for data warehousing • Handles changing workloads as elastically as the cloud • Replication, failover and recovery in the cloud is provided • YARN not yet supported • Data Model: Relational structured • Transaction Model: ACID • Commercial: Pay per term, per node • Features that are not included: • Geospatial functions • Live aggregate projections • Time series analytics • Text search • Advanced analytics packages
  • 19. JethroData • Index based, all data is indexed • Columnar proprietary file format • Supports use with AWS S3 data • Running on dedicated edge nodes besides Hadoop cluster • Supports use with Qlik, Tableau or MicroStrategy through ODBC/JDBC • Commercial Demo with Qlik: http://jethrodata.qlik.com/hub/stream/aaec8d41-5201-43ab-809f-3063750dfafd
  • 21. Tool selection based on use case requirements https://www.mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details
  • 23. • Long list of alternatives to choose from • Many tools claim to be fastest in market (self-done evaluation tests) • SQL and data type support varies  It really depends on what your requirements for this capability are – at least for batch and interactive SQL use cases there are many tools that are probably “good enough”  Doing testing and evaluation with actual use cases is typically needed  Start with tools recommended or certified by selected Hadoop distribution  Consider also amount of concurrent use, not only single query performance  SQL-on-Hadoop capabilities are usually not a differentiator that should guide selection of Hadoop distribution How to select correct tool for your SQL-on-Hadoop use cases?