SlideShare a Scribd company logo
© 2014 IBM Corporation
Information Management
Evaluating SQL-on-Hadoop
Performance and Compatibility
IBM Big SQL Hadoop-DS Benchmark
Last revised: Oct 26, 2014
© 2014 IBM Corporation2
Information Management
Agenda
About Big SQL
The TPC-DS™ Benchmark
The Hadoop-DS Benchmark
Big SQL performance
 30TB Hadoop-DS result with IBM Big SQL
 10TB Hadoop-DS comparison with Cloudera Impala™ and
Hortonworks® Hive
Conclusions
Additional detail
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
Hortonworks is a trademark of Hortonworks inc.
Other company, product, or service names may be trademarks or service marks of others.
© 2014 IBM Corporation3
Information Management
The case for SQL on Hadoop
SQL has become ubiquitous in today’s data center
Customers have large existing investments
 Skills, commercial & in-house applications
70% of Big Data initiatives involve transactional data1
 Transactional big data well suited to SQL
Standardization & compatibility are essential
 Customers modernizing warehouse environments cannot afford
separate SQL dialects and tools for different data sources
1. 70% of 465 survey respondents cite transactional data as a primary target for big data initiatives - Gartner research note “Survey Analysis - Big
Data Adoption in 2013 Shows Substance Behind the Hype“ Sept 12 2013 Analyst(s): Lisa Kart, Nick Heudecker, Frank Buytendijk
© 2014 IBM Corporation4
Information Management
IBM InfoSphere BigInsights - Big SQL
Big SQL = Big Investment Protection
Rich ANSI SQL support
Outstanding performance
Native Hadoop data sources
Federation: multiple data
sources
Extensive analytic functions
Security built-in
Native Hadoop Data Sources
CSV SEQ Parquet RC
AVRO ORC JSON Custom
Optimized SQL MPP Run-time
Big SQL
SQL based
Application
IBM invented SQL and has over thirty
years of experience engineering
advanced SQL query engines
© 2014 IBM Corporation5
Information Management
IBM InfoSphere BigInsights - Big SQL
Application Portability &
Integration
 Native Hadoop Data
 Comprehensive file formats
Performance
 Powerful query re-writer
 Cost-based optimizer
 Sophisticated memory mgmt.
Federation
 Single SQL statement
 Multiple data sources
 DB2, Oracle, Teradata & more
Enterprise Features
 Security & Auditing
 Self-tuning & management
 Comprehensive monitoring
Rich SQL Support
 ANSI Compliant
 IBM PL SQL Compatible
 Extensive Analytic Functions
Big SQL = Big Investment Protection
© 2014 IBM Corporation6
Information Management
About the TPC-DS Benchmark – www.tpc.org
 Models a hypothetical retail operation
 Realistic multi-domain data warehouse environment
 Retail sales, web, catalog data, inventory, demographics & promotions
 Models several aspects of business operations
 Queries, concurrency, data loading, data maintenance
 Designed for relational data warehouse product offerings
 Four broad types of queries:
 Reporting queries, Ad-hoc queries, Iterative OLAP queries, Data mining
queries – 99 queries in all
 Designed for multiple scale factors
 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB
 Designed for multi-user concurrency
 Minimum of 4 concurrent users running all 99 queries
 No vendor has ever published a formal TPC-DS benchmark
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
© 2014 IBM Corporation7
Information Management
Beware of Cherry Pickers & Benchmarketing!
TPC-DS has strict requirements
 All 99 queries need to be run in their entirety
 Each query is unique and tests a different
facet of the environment
 Answer set correctness must be proven
 Result must be audited
As a result, it is not valid to:
 Select individual queries
 Change queries outside of prescribed
guidelines (“minor query modifications”)
 Alter the database schema
 Configure the system on a per-query basis
 Alter the system between single-user and
multi-user tests
© 2014 IBM Corporation8
Information Management
About the Hadoop-DS Benchmark
 Created by IBM
 The Big Data Decision Support Benchmark (Hadoop-DS) is inspired
by, and is highly compliant with TPC-DS
 Fully complies with the TPC-DS schema requirement
 Uses all 99 queries
 Meets the multi-user requirement
 Has been audited by an approved TPC-DS auditor but as a non-TPC
benchmark
 There are deviations from TPC-DS
 No data maintenance operations, referential integrity enforcement, or ACID
property validation as these are not feasible with HDFS
 Additional statistics used (advanced Big SQL capability)
 Different metric (to avoid confusion with TPC-DS)
 No price and price/performance measures included
 Not an official TPC benchmark result
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
© 2014 IBM Corporation9
Information Management
What are the key Hadoop-DS metrics?
Primary metrics:
 Qph Hadoop-DS@SF (Single User)
• Single User Queries-per-hour at a particular scaling factor)
 Qph Hadoop-DS@SF (Multi User)
• Multi User Queries-per-hour at a particular scaling factor)
Two distinct measures
 Power run – refers to a single stream of queries running in sequence
 Throughput run – refers to a multiple streams of queries executing
concurrently. A minimum of four concurrent streams is required
© 2014 IBM Corporation10
Information Management
What did IBM Test?
30TB Hadoop-DS benchmark with Big SQL
 Executed and audited in as compliant a manner as possible
• Demonstrate the robustness of Big SQL at scale
• Demonstrate Big SQL’s ability to run all queries
• Demonstrate Big SQL’s multi-user concurrency capability
 Letter of attestation from the auditing firm and accompanying
benchmark report.
10TB subset Hadoop-DS benchmark with 3 vendors
 Compare the Big SQL, Cloudera Impala and Hortonworks Hive
 Use the subset of queries all three vendors are able to execute
 Use an identical 17 node cluster for each vendor
 Auditor reviewed method, procedures and measurement results
Two main benchmarks were executed
© 2014 IBM Corporation11
Information Management
Benchmark Environment
Management Node
One x3650 M4 BD
Two E5-2680 v2 2.8GHz 10-core
128GB RAM, 1866MHz
2TB 3.5” HDD
Dual-port 10GbE
RHEL 6.4
EXT4/HDFS/Parquet/ORC
Data Nodes
Seventeen x3650 M4 BD
Two E5-2680 2.8GHz 10-core
128GB RAM, 1866MHz
Ten 2TB 3.5” HDD
Four 120GB 3.5” SSD
Dual-port 10GbE
RHEL 6.4
EXT4/HDFS/Parquet/ORC
Three identical clusters deployed, one for each distribution
Note: Big SQL and Impala used
Parquet file formats. Hive used
ORC
© 2014 IBM Corporation12
Information Management
Big SQL 3.0
Working directly
from template
Compliant query
modifications
Impala 1.4.1
Working directly
from template
Compliant query
modifications
Non-compliant query
re-write
Not working or no
re-write
IBM Big SQL runs 100% of the queries
 IBM Big SQL runs all 99 queries, 12
with allowable minor modifications
 Impala runs only 52 queries – 35 out-
of-the-box and 17 with allowable
minor modifications
 Hive runs 58 queries – 32 out of the
box, and 26 with allowable minor
modifications
Hive 0.13
Working directly
from template
Compliant query
modifications
Non-compliant query
re-write
Not working or no
re-write
© 2014 IBM Corporation13
Information Management
Query compliance by SQL-on-Hadoop offering
IBM is the only vendor
able to run all 99
Hadoop-DS queries with
minor modifications
allowable under TPC-DS
benchmark rules
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Big SQL 3.0 Impala 1.4.1 Hive 0.13
Not working or no re-write
Non-compliant query re-write
Compliant query modifications
Working directly from template
Queries
© 2014 IBM Corporation14
Information Management
Hadoop DS – Query Compliance Detail
Small-scale Test (1 GB) 10TB scale Test
Number of queries Big SQL 3.0 Impala 1.4.1 Hive 0.13 Big SQL 3.0 Impala 1.4.1 Hive 0.13
Original query
unchanged
87 35 32 87 31 27
Minor query
modifications
12 17 26 12 11 29
Major query re-write 0 36 32 0 30 13
Percentage of
queries that run 100% 89% 91% 100% 73% 70%
Not working or no
re-write found
0 11 9 0 27 30
© 2014 IBM Corporation15
Information Management
IBM Big SQL – Runs 100% of the queries
Key points
 In competitive environments,
many queries needed to be re-
written, some significantly
 Owing to various restrictions,
some queries could not be re-
written or failed at run-time
 Re-writing queries in a
benchmark scenario where
results are known is one thing –
doing this against real databases
in production is another
Competitive environments require significant effort at scale
Results for 10TB scale shown here
© 2014 IBM Corporation16
Information Management
Hadoop-DS – Performance results
Elapsed time (s) Qph-HDS@10TB Big SQL Advantage
# Queries Power Throughput Power Throughput Power Throughput
IBM Big SQL 3.0 46 2,908 6,945 5,694 9,537
Impala 1.4.1 46 10,536 14,920 1,571 4,439 3.6 2.1
Hive 0.13 46 15,949 59,550 1,038 1,112 5.4 8.5
All 99 queries @ 10TB
IBM Big SQL 3.0 99 32,361 88,764 1,101 2,409
Impala 1.4.1 99 Not Possible
Hive 0.13 99 Not Possible
All 99 queries @ 30TB
IBM Big SQL 3.0 99 104,445 187,993 1,023 2,274
Impala 1.4.1 99 Not Possible
Hive 0.13 99 Not Possible
© 2014 IBM Corporation17
Information Management
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
Big SQL Impala Hive
Power run (single-stream) – seconds
As measured across the subset of 46 queries that Impala and Hive can both run
IBM Big SQL – Leading performance
Up to 5.4x
FASTER!!
48:28
2:55:36
4:25:49
3.6x faster than Impala, 5.4x faster than Hive
seconds
© 2014 IBM Corporation18
Information Management
0
10000
20000
30000
40000
50000
60000
70000
Big SQL Impala Hive
Throughput run - 4 streams, average elapsed time
As measured across the subset of 46 queries that Impala and Hive can both run
IBM Big SQL – Leading performance
1:55:45
4:08:40
16:32:30
Up to 8.5x
FASTER!!
2.1x faster than Impala, 8.5x faster than Hive
seconds
© 2014 IBM Corporation19
Information Management
30TB Hadoop-DS Results
Because other distributions could not run the 99 required
queries, it was only possible to obtain a result for Big SQL
IBM had hoped to obtain partial results @ 30TB (comparing
queries that would run across distributions)
Testing convinced us that the number of queries that
competitors could run @ 30TB was sufficiently small that a
detailed comparison would not be valid
© 2014 IBM Corporation20
Information Management
Big SQL – Scalability and Throughput
Four concurrent query streams @30TB in 1.8x time of a single stream
0
50,000
100,000
150,000
200,000
Power Run Throughput Run
ElapsedTime(secs)
Elapsed Times for Big SQL Hadoop-DS @30TB.
Single & 4 streams. 99 queries.
99 queries
396
queries
© 2014 IBM Corporation21
Information Management
Audited Results
 Letters of attestation are available for
both Hadoop-DS benchmarks at
10TB and 30TB scale
 InfoSizing, Transaction Processing
Performance Council Certified
Auditors verified both IBM results as
well as results on Cloudera Impala
and Hortonworks Hive
 These results are for a non-TPC
benchmark. A subset of the TPC-DS
Benchmark standard requirements
was implemented
© 2014 IBM Corporation22
Information Management
Conclusions
Big SQL is the only SQL-on-Hadoop engine able to run a
full Hadoop-DS workload
 Complete schema
 All 99 queries
 Multi-user test
 Ran at both 10TB and 30TB data volumes
 Together this test makes for a good predictor of compatibility with
real applications
IBM Big SQL is the best performing solution by a large
margin
 ~ 3.6 times better than Cloudera Impala
 ~ 5.4 times better than Hortonworks Hive
© 2014 IBM Corporation23
Information Management
Thank you!
© 2014 IBM Corporation24
Information Management
Additional Slides
© 2014 IBM Corporation25
Information Management
0 500 1,000 1,500 2,000 2,500 3,000
Big SQL
Impala
Hive .13
4 concurrent streams and 99 queries
Query throughput for Hadoop-DS @ 10TB
87
12
99 queries could not be run
99 queries could not be run
Effective query throughput (Qph-HDS@10TB)
© 2014 IBM Corporation26
Information Management
0 500 1,000 1,500 2,000 2,500
Big SQL
Impala
Hive .13
6 concurrent streams and 99 queries
Effective query throughput (Qph-HDS@30TB)
Query throughput for Hadoop-DS @ 30TB
99 queries could not be run
99 queries could not be run
© 2014 IBM Corporation27
Information Management
The Common Query Set
 While Big SQL ran all queries, many of the Hadoop-DS
queries would not run on Impala or Hive
 On both platforms, some additional queries could be
made to run by re-writing the queries (something that is
not permitted in the TPC-DS benchmark specification)
 At 10TB scale, several queries failed at run-time
 This set of 46 queries are the common set that ran at 10
TB scale and could thus be compared
 The testing team deliberately included some queries
with non-compliant query modifications where the
changes were judged to be minor in order to have a
reasonable number of queries to compare
46 queries could be run on Big SQL, Impala and Hive at 10TB
Queries shown in blue are part of the common set
© 2014 IBM Corporation28
Information Management
About the TPC-DS queries
The queries are diverse, and many are complex
Reflecting real business needs – a random sample:
 Find customers returning items more frequently than normal (q1)
 States with customers most ammenable to premium priced offers (q6)
 List key metrics for unadvertised in-store promotions by demographic (q7)
 Identify similar customers purchasing through multiple sales outlets (q10)
 Find customers shifting purchasing habits to the web (q11)
 Key measures for catalog sales fulfilled from an alternate warehouse (q16)
 Find frequently sold items and the circumstances under which repeat sales
take place (q23)
 Understand the products and retail locations where items are likely to be
return and subsequently re-purchased via the catalog (q29)
 Display customers making significant local purchases comparing to buying
potential based on dependents and vehicles owned (q34)
© 2014 IBM Corporation29
Information Management
Benchmark Environment
X3650BD Data node #1
X3650BD Data node #2
X3650BD Data node #3
X3650BD Data node #4
X3650BD Data node #5
X3650BD Data node #6
X3650BD Data node #7
X3650BD Data node #8
X3650BD Data node #9
X3650BD Data node #10
X3650BD Data node #11
X3650BD Data node #12
X3650BD Data node #13
X3650BD Data node #14
X3650BD Data node #15
X3650BD Data node #16
10 GbE switch 10 GbE private net
IBM Blue net
Mgmt net
X3650BD Master host
Three identical clusters deployed, one for each distribution

More Related Content

What's hot

Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Cynthia Saracco
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Nicolas Morales
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
Cloudera, Inc.
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
Cynthia Saracco
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
Craig Jordan
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers
EDB
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
DataWorks Summit
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
Data Con LA
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 

What's hot (16)

Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 

Similar to Hadoop-DS: Which SQL-on-Hadoop Rules the Herd

Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analytics
jdijcks
 
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herdBenchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Gord Sissons
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- Altibase
Altibase
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
DataWorks Summit
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
ModusOptimum
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
epamspb
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 

Similar to Hadoop-DS: Which SQL-on-Hadoop Rules the Herd (20)

Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analytics
 
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herdBenchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herd
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- Altibase
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 

More from IBM Analytics

Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
IBM Analytics
 
10 WealthTech podcasts every wealth advisor should listen to
10 WealthTech podcasts every wealth advisor should listen to10 WealthTech podcasts every wealth advisor should listen to
10 WealthTech podcasts every wealth advisor should listen to
IBM Analytics
 
Advantages of an integrated governance, risk and compliance environment
Advantages of an integrated governance, risk and compliance environmentAdvantages of an integrated governance, risk and compliance environment
Advantages of an integrated governance, risk and compliance environment
IBM Analytics
 
Cognitive banking with expert insights
Cognitive banking with expert insightsCognitive banking with expert insights
Cognitive banking with expert insights
IBM Analytics
 
Sales performance management and C-level goals
Sales performance management and C-level goalsSales performance management and C-level goals
Sales performance management and C-level goals
IBM Analytics
 
The science of client insight: Increase revenue through improved engagement
The science of client insight: Increase revenue through improved engagementThe science of client insight: Increase revenue through improved engagement
The science of client insight: Increase revenue through improved engagement
IBM Analytics
 
Expert opinion on managing data breaches
Expert opinion on managing data breachesExpert opinion on managing data breaches
Expert opinion on managing data breaches
IBM Analytics
 
Top industry use cases for streaming analytics
Top industry use cases for streaming analyticsTop industry use cases for streaming analytics
Top industry use cases for streaming analytics
IBM Analytics
 
Make data simple in the cognitive era
Make data simple in the cognitive eraMake data simple in the cognitive era
Make data simple in the cognitive era
IBM Analytics
 
IBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive era
IBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive eraIBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive era
IBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive era
IBM Analytics
 
4 common headaches with sales compensation management
4 common headaches with sales compensation management4 common headaches with sales compensation management
4 common headaches with sales compensation management
IBM Analytics
 
IBM Virtual Finance Forum 2016: Top 10 reasons to attend
IBM Virtual Finance Forum 2016: Top 10 reasons to attendIBM Virtual Finance Forum 2016: Top 10 reasons to attend
IBM Virtual Finance Forum 2016: Top 10 reasons to attend
IBM Analytics
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineers
IBM Analytics
 
How secure is your enterprise from threats?
How secure is your enterprise from threats? How secure is your enterprise from threats?
How secure is your enterprise from threats?
IBM Analytics
 
10 benefits to thinking inside Box
10 benefits to thinking inside Box10 benefits to thinking inside Box
10 benefits to thinking inside Box
IBM Analytics
 
The digital transformation of the French Open
The digital transformation of the French OpenThe digital transformation of the French Open
The digital transformation of the French Open
IBM Analytics
 
Bridging to a hybrid cloud data services architecture
Bridging to a hybrid cloud data services architectureBridging to a hybrid cloud data services architecture
Bridging to a hybrid cloud data services architecture
IBM Analytics
 
What does data tell you about the customer journey?
What does data tell you about the customer journey?What does data tell you about the customer journey?
What does data tell you about the customer journey?
IBM Analytics
 
What CEOs want from CDOs and how to deliver on it
What CEOs want from CDOs and how to deliver on itWhat CEOs want from CDOs and how to deliver on it
What CEOs want from CDOs and how to deliver on it
IBM Analytics
 
Banking in the age of the empowered consumer
Banking in the age of the empowered consumerBanking in the age of the empowered consumer
Banking in the age of the empowered consumer
IBM Analytics
 

More from IBM Analytics (20)

Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
10 WealthTech podcasts every wealth advisor should listen to
10 WealthTech podcasts every wealth advisor should listen to10 WealthTech podcasts every wealth advisor should listen to
10 WealthTech podcasts every wealth advisor should listen to
 
Advantages of an integrated governance, risk and compliance environment
Advantages of an integrated governance, risk and compliance environmentAdvantages of an integrated governance, risk and compliance environment
Advantages of an integrated governance, risk and compliance environment
 
Cognitive banking with expert insights
Cognitive banking with expert insightsCognitive banking with expert insights
Cognitive banking with expert insights
 
Sales performance management and C-level goals
Sales performance management and C-level goalsSales performance management and C-level goals
Sales performance management and C-level goals
 
The science of client insight: Increase revenue through improved engagement
The science of client insight: Increase revenue through improved engagementThe science of client insight: Increase revenue through improved engagement
The science of client insight: Increase revenue through improved engagement
 
Expert opinion on managing data breaches
Expert opinion on managing data breachesExpert opinion on managing data breaches
Expert opinion on managing data breaches
 
Top industry use cases for streaming analytics
Top industry use cases for streaming analyticsTop industry use cases for streaming analytics
Top industry use cases for streaming analytics
 
Make data simple in the cognitive era
Make data simple in the cognitive eraMake data simple in the cognitive era
Make data simple in the cognitive era
 
IBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive era
IBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive eraIBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive era
IBM CDO Fall Summit 2016 Keynote: Driving innovation in the cognitive era
 
4 common headaches with sales compensation management
4 common headaches with sales compensation management4 common headaches with sales compensation management
4 common headaches with sales compensation management
 
IBM Virtual Finance Forum 2016: Top 10 reasons to attend
IBM Virtual Finance Forum 2016: Top 10 reasons to attendIBM Virtual Finance Forum 2016: Top 10 reasons to attend
IBM Virtual Finance Forum 2016: Top 10 reasons to attend
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineers
 
How secure is your enterprise from threats?
How secure is your enterprise from threats? How secure is your enterprise from threats?
How secure is your enterprise from threats?
 
10 benefits to thinking inside Box
10 benefits to thinking inside Box10 benefits to thinking inside Box
10 benefits to thinking inside Box
 
The digital transformation of the French Open
The digital transformation of the French OpenThe digital transformation of the French Open
The digital transformation of the French Open
 
Bridging to a hybrid cloud data services architecture
Bridging to a hybrid cloud data services architectureBridging to a hybrid cloud data services architecture
Bridging to a hybrid cloud data services architecture
 
What does data tell you about the customer journey?
What does data tell you about the customer journey?What does data tell you about the customer journey?
What does data tell you about the customer journey?
 
What CEOs want from CDOs and how to deliver on it
What CEOs want from CDOs and how to deliver on itWhat CEOs want from CDOs and how to deliver on it
What CEOs want from CDOs and how to deliver on it
 
Banking in the age of the empowered consumer
Banking in the age of the empowered consumerBanking in the age of the empowered consumer
Banking in the age of the empowered consumer
 

Recently uploaded

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Hadoop-DS: Which SQL-on-Hadoop Rules the Herd

  • 1. © 2014 IBM Corporation Information Management Evaluating SQL-on-Hadoop Performance and Compatibility IBM Big SQL Hadoop-DS Benchmark Last revised: Oct 26, 2014
  • 2. © 2014 IBM Corporation2 Information Management Agenda About Big SQL The TPC-DS™ Benchmark The Hadoop-DS Benchmark Big SQL performance  30TB Hadoop-DS result with IBM Big SQL  10TB Hadoop-DS comparison with Cloudera Impala™ and Hortonworks® Hive Conclusions Additional detail TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks is a trademark of Hortonworks inc. Other company, product, or service names may be trademarks or service marks of others.
  • 3. © 2014 IBM Corporation3 Information Management The case for SQL on Hadoop SQL has become ubiquitous in today’s data center Customers have large existing investments  Skills, commercial & in-house applications 70% of Big Data initiatives involve transactional data1  Transactional big data well suited to SQL Standardization & compatibility are essential  Customers modernizing warehouse environments cannot afford separate SQL dialects and tools for different data sources 1. 70% of 465 survey respondents cite transactional data as a primary target for big data initiatives - Gartner research note “Survey Analysis - Big Data Adoption in 2013 Shows Substance Behind the Hype“ Sept 12 2013 Analyst(s): Lisa Kart, Nick Heudecker, Frank Buytendijk
  • 4. © 2014 IBM Corporation4 Information Management IBM InfoSphere BigInsights - Big SQL Big SQL = Big Investment Protection Rich ANSI SQL support Outstanding performance Native Hadoop data sources Federation: multiple data sources Extensive analytic functions Security built-in Native Hadoop Data Sources CSV SEQ Parquet RC AVRO ORC JSON Custom Optimized SQL MPP Run-time Big SQL SQL based Application IBM invented SQL and has over thirty years of experience engineering advanced SQL query engines
  • 5. © 2014 IBM Corporation5 Information Management IBM InfoSphere BigInsights - Big SQL Application Portability & Integration  Native Hadoop Data  Comprehensive file formats Performance  Powerful query re-writer  Cost-based optimizer  Sophisticated memory mgmt. Federation  Single SQL statement  Multiple data sources  DB2, Oracle, Teradata & more Enterprise Features  Security & Auditing  Self-tuning & management  Comprehensive monitoring Rich SQL Support  ANSI Compliant  IBM PL SQL Compatible  Extensive Analytic Functions Big SQL = Big Investment Protection
  • 6. © 2014 IBM Corporation6 Information Management About the TPC-DS Benchmark – www.tpc.org  Models a hypothetical retail operation  Realistic multi-domain data warehouse environment  Retail sales, web, catalog data, inventory, demographics & promotions  Models several aspects of business operations  Queries, concurrency, data loading, data maintenance  Designed for relational data warehouse product offerings  Four broad types of queries:  Reporting queries, Ad-hoc queries, Iterative OLAP queries, Data mining queries – 99 queries in all  Designed for multiple scale factors  100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB  Designed for multi-user concurrency  Minimum of 4 concurrent users running all 99 queries  No vendor has ever published a formal TPC-DS benchmark TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
  • 7. © 2014 IBM Corporation7 Information Management Beware of Cherry Pickers & Benchmarketing! TPC-DS has strict requirements  All 99 queries need to be run in their entirety  Each query is unique and tests a different facet of the environment  Answer set correctness must be proven  Result must be audited As a result, it is not valid to:  Select individual queries  Change queries outside of prescribed guidelines (“minor query modifications”)  Alter the database schema  Configure the system on a per-query basis  Alter the system between single-user and multi-user tests
  • 8. © 2014 IBM Corporation8 Information Management About the Hadoop-DS Benchmark  Created by IBM  The Big Data Decision Support Benchmark (Hadoop-DS) is inspired by, and is highly compliant with TPC-DS  Fully complies with the TPC-DS schema requirement  Uses all 99 queries  Meets the multi-user requirement  Has been audited by an approved TPC-DS auditor but as a non-TPC benchmark  There are deviations from TPC-DS  No data maintenance operations, referential integrity enforcement, or ACID property validation as these are not feasible with HDFS  Additional statistics used (advanced Big SQL capability)  Different metric (to avoid confusion with TPC-DS)  No price and price/performance measures included  Not an official TPC benchmark result TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
  • 9. © 2014 IBM Corporation9 Information Management What are the key Hadoop-DS metrics? Primary metrics:  Qph Hadoop-DS@SF (Single User) • Single User Queries-per-hour at a particular scaling factor)  Qph Hadoop-DS@SF (Multi User) • Multi User Queries-per-hour at a particular scaling factor) Two distinct measures  Power run – refers to a single stream of queries running in sequence  Throughput run – refers to a multiple streams of queries executing concurrently. A minimum of four concurrent streams is required
  • 10. © 2014 IBM Corporation10 Information Management What did IBM Test? 30TB Hadoop-DS benchmark with Big SQL  Executed and audited in as compliant a manner as possible • Demonstrate the robustness of Big SQL at scale • Demonstrate Big SQL’s ability to run all queries • Demonstrate Big SQL’s multi-user concurrency capability  Letter of attestation from the auditing firm and accompanying benchmark report. 10TB subset Hadoop-DS benchmark with 3 vendors  Compare the Big SQL, Cloudera Impala and Hortonworks Hive  Use the subset of queries all three vendors are able to execute  Use an identical 17 node cluster for each vendor  Auditor reviewed method, procedures and measurement results Two main benchmarks were executed
  • 11. © 2014 IBM Corporation11 Information Management Benchmark Environment Management Node One x3650 M4 BD Two E5-2680 v2 2.8GHz 10-core 128GB RAM, 1866MHz 2TB 3.5” HDD Dual-port 10GbE RHEL 6.4 EXT4/HDFS/Parquet/ORC Data Nodes Seventeen x3650 M4 BD Two E5-2680 2.8GHz 10-core 128GB RAM, 1866MHz Ten 2TB 3.5” HDD Four 120GB 3.5” SSD Dual-port 10GbE RHEL 6.4 EXT4/HDFS/Parquet/ORC Three identical clusters deployed, one for each distribution Note: Big SQL and Impala used Parquet file formats. Hive used ORC
  • 12. © 2014 IBM Corporation12 Information Management Big SQL 3.0 Working directly from template Compliant query modifications Impala 1.4.1 Working directly from template Compliant query modifications Non-compliant query re-write Not working or no re-write IBM Big SQL runs 100% of the queries  IBM Big SQL runs all 99 queries, 12 with allowable minor modifications  Impala runs only 52 queries – 35 out- of-the-box and 17 with allowable minor modifications  Hive runs 58 queries – 32 out of the box, and 26 with allowable minor modifications Hive 0.13 Working directly from template Compliant query modifications Non-compliant query re-write Not working or no re-write
  • 13. © 2014 IBM Corporation13 Information Management Query compliance by SQL-on-Hadoop offering IBM is the only vendor able to run all 99 Hadoop-DS queries with minor modifications allowable under TPC-DS benchmark rules 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Big SQL 3.0 Impala 1.4.1 Hive 0.13 Not working or no re-write Non-compliant query re-write Compliant query modifications Working directly from template Queries
  • 14. © 2014 IBM Corporation14 Information Management Hadoop DS – Query Compliance Detail Small-scale Test (1 GB) 10TB scale Test Number of queries Big SQL 3.0 Impala 1.4.1 Hive 0.13 Big SQL 3.0 Impala 1.4.1 Hive 0.13 Original query unchanged 87 35 32 87 31 27 Minor query modifications 12 17 26 12 11 29 Major query re-write 0 36 32 0 30 13 Percentage of queries that run 100% 89% 91% 100% 73% 70% Not working or no re-write found 0 11 9 0 27 30
  • 15. © 2014 IBM Corporation15 Information Management IBM Big SQL – Runs 100% of the queries Key points  In competitive environments, many queries needed to be re- written, some significantly  Owing to various restrictions, some queries could not be re- written or failed at run-time  Re-writing queries in a benchmark scenario where results are known is one thing – doing this against real databases in production is another Competitive environments require significant effort at scale Results for 10TB scale shown here
  • 16. © 2014 IBM Corporation16 Information Management Hadoop-DS – Performance results Elapsed time (s) Qph-HDS@10TB Big SQL Advantage # Queries Power Throughput Power Throughput Power Throughput IBM Big SQL 3.0 46 2,908 6,945 5,694 9,537 Impala 1.4.1 46 10,536 14,920 1,571 4,439 3.6 2.1 Hive 0.13 46 15,949 59,550 1,038 1,112 5.4 8.5 All 99 queries @ 10TB IBM Big SQL 3.0 99 32,361 88,764 1,101 2,409 Impala 1.4.1 99 Not Possible Hive 0.13 99 Not Possible All 99 queries @ 30TB IBM Big SQL 3.0 99 104,445 187,993 1,023 2,274 Impala 1.4.1 99 Not Possible Hive 0.13 99 Not Possible
  • 17. © 2014 IBM Corporation17 Information Management 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 Big SQL Impala Hive Power run (single-stream) – seconds As measured across the subset of 46 queries that Impala and Hive can both run IBM Big SQL – Leading performance Up to 5.4x FASTER!! 48:28 2:55:36 4:25:49 3.6x faster than Impala, 5.4x faster than Hive seconds
  • 18. © 2014 IBM Corporation18 Information Management 0 10000 20000 30000 40000 50000 60000 70000 Big SQL Impala Hive Throughput run - 4 streams, average elapsed time As measured across the subset of 46 queries that Impala and Hive can both run IBM Big SQL – Leading performance 1:55:45 4:08:40 16:32:30 Up to 8.5x FASTER!! 2.1x faster than Impala, 8.5x faster than Hive seconds
  • 19. © 2014 IBM Corporation19 Information Management 30TB Hadoop-DS Results Because other distributions could not run the 99 required queries, it was only possible to obtain a result for Big SQL IBM had hoped to obtain partial results @ 30TB (comparing queries that would run across distributions) Testing convinced us that the number of queries that competitors could run @ 30TB was sufficiently small that a detailed comparison would not be valid
  • 20. © 2014 IBM Corporation20 Information Management Big SQL – Scalability and Throughput Four concurrent query streams @30TB in 1.8x time of a single stream 0 50,000 100,000 150,000 200,000 Power Run Throughput Run ElapsedTime(secs) Elapsed Times for Big SQL Hadoop-DS @30TB. Single & 4 streams. 99 queries. 99 queries 396 queries
  • 21. © 2014 IBM Corporation21 Information Management Audited Results  Letters of attestation are available for both Hadoop-DS benchmarks at 10TB and 30TB scale  InfoSizing, Transaction Processing Performance Council Certified Auditors verified both IBM results as well as results on Cloudera Impala and Hortonworks Hive  These results are for a non-TPC benchmark. A subset of the TPC-DS Benchmark standard requirements was implemented
  • 22. © 2014 IBM Corporation22 Information Management Conclusions Big SQL is the only SQL-on-Hadoop engine able to run a full Hadoop-DS workload  Complete schema  All 99 queries  Multi-user test  Ran at both 10TB and 30TB data volumes  Together this test makes for a good predictor of compatibility with real applications IBM Big SQL is the best performing solution by a large margin  ~ 3.6 times better than Cloudera Impala  ~ 5.4 times better than Hortonworks Hive
  • 23. © 2014 IBM Corporation23 Information Management Thank you!
  • 24. © 2014 IBM Corporation24 Information Management Additional Slides
  • 25. © 2014 IBM Corporation25 Information Management 0 500 1,000 1,500 2,000 2,500 3,000 Big SQL Impala Hive .13 4 concurrent streams and 99 queries Query throughput for Hadoop-DS @ 10TB 87 12 99 queries could not be run 99 queries could not be run Effective query throughput (Qph-HDS@10TB)
  • 26. © 2014 IBM Corporation26 Information Management 0 500 1,000 1,500 2,000 2,500 Big SQL Impala Hive .13 6 concurrent streams and 99 queries Effective query throughput (Qph-HDS@30TB) Query throughput for Hadoop-DS @ 30TB 99 queries could not be run 99 queries could not be run
  • 27. © 2014 IBM Corporation27 Information Management The Common Query Set  While Big SQL ran all queries, many of the Hadoop-DS queries would not run on Impala or Hive  On both platforms, some additional queries could be made to run by re-writing the queries (something that is not permitted in the TPC-DS benchmark specification)  At 10TB scale, several queries failed at run-time  This set of 46 queries are the common set that ran at 10 TB scale and could thus be compared  The testing team deliberately included some queries with non-compliant query modifications where the changes were judged to be minor in order to have a reasonable number of queries to compare 46 queries could be run on Big SQL, Impala and Hive at 10TB Queries shown in blue are part of the common set
  • 28. © 2014 IBM Corporation28 Information Management About the TPC-DS queries The queries are diverse, and many are complex Reflecting real business needs – a random sample:  Find customers returning items more frequently than normal (q1)  States with customers most ammenable to premium priced offers (q6)  List key metrics for unadvertised in-store promotions by demographic (q7)  Identify similar customers purchasing through multiple sales outlets (q10)  Find customers shifting purchasing habits to the web (q11)  Key measures for catalog sales fulfilled from an alternate warehouse (q16)  Find frequently sold items and the circumstances under which repeat sales take place (q23)  Understand the products and retail locations where items are likely to be return and subsequently re-purchased via the catalog (q29)  Display customers making significant local purchases comparing to buying potential based on dependents and vehicles owned (q34)
  • 29. © 2014 IBM Corporation29 Information Management Benchmark Environment X3650BD Data node #1 X3650BD Data node #2 X3650BD Data node #3 X3650BD Data node #4 X3650BD Data node #5 X3650BD Data node #6 X3650BD Data node #7 X3650BD Data node #8 X3650BD Data node #9 X3650BD Data node #10 X3650BD Data node #11 X3650BD Data node #12 X3650BD Data node #13 X3650BD Data node #14 X3650BD Data node #15 X3650BD Data node #16 10 GbE switch 10 GbE private net IBM Blue net Mgmt net X3650BD Master host Three identical clusters deployed, one for each distribution