SlideShare a Scribd company logo
© 2017 IBM CorporationSept 2017
Breaching the 100TB mark with
SQL over Hadoop
Analytics Performance
.1 Highlights
L’s leadership in performance and scalability
ovements
ts
partitioning options
ments
mory and caching improvements
s with ORC file format
ther on these enhancements by allowing a
Simon Harris (siharris@au1.ibm.com)
IBM Research
Priya Tiruthani (ntiruth@us.ibm.com)
Big SQL Offering Manager
© 2017 IBM Corporation2
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which
IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for
informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.
While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without
warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of
IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have
achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to,
nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2017. All rights reserved.
— U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in
the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on
the Web at
“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other
countries.
Other company, product, or service names may be trademarks or service marks of others.
© 2017 IBM Corporation3
Big SQL is the only SQL-on-Hadoop
solution to understand SQL syntax from
other vendors and products, including:
Oracle, IBM Db2 and Netezza.
For this reason, Big SQL is the ultimate
hybrid engine to optimize EDW workloads
on an open Hadoop platform
What is IBM Big SQL?
© 2017 IBM Corporation4
Want to modernize
your EDW without
long and costly
migration efforts
Offloading historical
data from Oracle,
Db2, Netezza
because reaching
capacity
Operationalize
machine learning
Need to query,
optimize and
integrate multiple
data sources from
one single endpoint
Slow query
performance for SQL
workloads
Require skill set to
migrate data from
RDBMS to
Hadoop / Hive
Do you have any of these challenges?
© 2017 IBM Corporation5
Here’s How Big SQL Addresses These Challenges
 Compatible with Oracle, Db2 & Netezza SQL syntax
 Modernizing EDW workloads on Hadoop has never been easier
 Application portability (eg: Cognos, Tableau, MicroStrategy,…)
 Federates all your data behind a single SQL engine
 Query Hive, Spark and HBase data from a single endpoint
 Federate your Hadoop data using connectors to Teradata, Oracle, Db2 & more
 Query data sources that have Spark connectors
 Addresses a skill set gap needed to migrate technologies
 Delivers high performance & concurrency for BI workloads
 Unlock Hadoop data with analytics tools of choice
 Provides greater security while accessing data
 Robust role-based access control and Ranger integration
 Operationalize machine learning through integration with Spark
 Bi-directional integration with Spark exploits Spark’s connectors as well as ML
capabilities
© 2017 IBM Corporation6
IBM’s Big SQL Preserves Open Source Foundation
Leverages Hive metastore and storage formats.
No Lock-in. Data part of Hadoop, not Big SQL.
SQL Execution Engines
Big SQL
(IBM)
Hive
(Open Source)
Hive Storage Model
(open source)
CSV Parquet ORC Others…Tab Delim.
Hive Metastore
(open source)
Applications
© 2017 IBM Corporation7
Big SQL queries heterogeneous systems in a single query - only SQL-on-Hadoop that virtualizes more than 10
different data sources: RDBMS, NoSQL, HDFS or Object Store
Big SQL
Fluid Query (federation)
Oracle
SQL
Server
Teradata DB2
Netezza
(PDA) Informix
Microsoft
SQL Server
Hive HBase HDFS
Object Store
(S3)
WebHDFS
Big SQL allows query federation by virtualizing data sources and processing where data resides
Hortonworks Data Platform (HDP)
Data Virtualization
© 2017 IBM Corporation8
 Easy porting of enterprise applications
 Ability to work seamlessly with Business Intelligence tools like Cognos to
gain insights
 Big SQL integrates with Information Governance Catalog by enabling easy
shared imports to InfoSphere Metadata Asset Manager, which allows:
 Analyze assets
 Utilize assets in jobs
 Designate stewards for the assets
Oracle
SQL
DB2
SQL
Netezza
SQL
Big SQL
SQL syntax tolerance (ANSI SQL Compliant)
Cognos Analytics
InfoSphere Metadata Asset Manager
Big SQL is a synergetic SQL engine that offers SQL compatibility, portability and
collaborative ability to get composite analysis on data
Data Offloading and Analytics
© 2017 IBM Corporation9
BRANCH_A FINANCE
(security admin)BRANCH_B
Role Based Access Control
enables separation
of Duties / Audit
Row Level Security
Row and Column Level Security
Big SQL offers row and column level access control (RBAC) among other security settings
Data Security
© 2017 IBM Corporation10
Themes
Integration
Performance
Usability &
Serviceability
Enterprise,
Governance &
Security
Bringing together different components of the Hadoop and making sure
Big SQL offers enhanced capabilities and smooth.
Execution of queries, simple or complex, needs to complete with low
latency. Big SQL continues to focus on improving the query execution for
all open source file formats.
By simplifying the complexity of setting up or trouble-shooting that
comes with Hadoop ecosystem, users will benefit by increases the
productivity of their use cases.
Enterprise needs are specific to application portability and data security.
Big SQL has high application portability and makes strides to enhance it
further. Big SQL also focuses on having centralized governance and
auditing for.
© 2017 IBM Corporation11
• At a High Level:
• Bi-directional Spark integration allows you to
run Spark jobs from Big SQL
• Ranger integration provide centralized
security
• Yarn integration for easy flex up/down of Big
SQL workers
• Integration with HDP v2.6.x
Bi-directional Spark integration+
Ranger integration+
HDP integration+
Integration
Big SQL v5.0.x focuses on providing integration with the following…
YARN integration+
© 2017 IBM Corporation12
• At a High Level:
• Launch multiple Big SQL workers on a node
using Turbo Boost technology for SQL
execution
• Enhancements in handling ORC file format
has shown marked increase in performance
(at par with parquet file format)
• New benchmark shows great performance
ORC enhancement+
Performance
Constant upgrades to performance helps Big SQL to lead in performance for complex queries
Performance benchmarks+
Elastic Boost technology+
© 2017 IBM Corporation13
• At a High Level:
• Oracle compatibility for application portability
• Ranger integration provide centralized
security
Oracle compatibility+
Ranger integration+
Enterprise, Governance & Security
© 2017 IBM Corporation14
• At a High Level:
• UI driven simple install with a few clicks
• Data loaded for immediate use
• Tutorial guided using Zeppelin
Sandbox+
Big SQL Sandbox
A single node sandbox to visualize data using Zeppelin
© 2017 IBM Corporation15
Right Tool for the Right Job
Not Mutually Exclusive. Hive, Big SQL & Spark SQL can co-exist and complement each other in a cluster
Big SQL
Federation
Complex Queries
High Concurrency
Enterprise ready
Application portability
All open source file formats
Spark SQL
Machine learning
Data exploration
Simpler SQL
Hive
In-memory cache
Geospatial analytics
ACID capabilities
Fast ingest
Ideal tool for Data Scientists
and discovery
Ideal tool for BI Data Analysts
and production workloads
Ideal tool for BI Data Analysts
and production workloads
© 2017 IBM Corporation16
To Summarize - Core Themes of Big SQL
SQL Compatibility
•Understands different SQL
dialects
•Reuse skills and applications
with less/no changes
Federation
•Connect to remote data sources
•Query pushdown
•Spark connectors for more data
sources & ML models
Performance
•Execute all 99 TPCDS queries
•Scales linearly with increased
concurrency
Enterprise & Security
•Automatic memory management
•Role/column based data security
•SQL compatible with:
•Applications work as-is without
any changes
•Federates to more than 10 data
sources: RDBMS, NoSQL and/or
Object Stores
•Integrates bi-directionally with
Spark, like no other
•Operationalizes ML models
•Exhibits high performance
even when data scales up to
100TB with complex SQLs
•Handles many concurrent
users without relinquishing
performance
•Secures data using SQL with
roles
•Integrates with Ranger for
centralized management
Big SQL is the only SQL-on-Hadoop engine that……
© 2017 IBM Corporation17
Hadoop-DS @ 100TB
Breaching the 100TB mark: The Environment
F1 ClusterLoad
Single
Stream
4-Streams
© 2017 IBM Corporation18
About the Hadoop-DS Workload
Aim: To provide the fairest and most meaningful comparison of SQL
over Hadoop solutions
Hadoop-DS benchmark is based on the TPC-DS* benchmark.
Strives to follow latest (v2.3) TPC-DS specification whenever possible.
Key deviations:
 No data maintenance or data persistence phases - not possible across all vendors
 Uses a sub-set of queries that all solutions can successfully execute at that scale factor
 Queries are not cherry picked
Is STILL the most complete TPC-DS like benchmark executed so far
Includes database build, single stream run and multi-stream run
First published in Oct 2014, using Big SQL, Impala and Hive
This publication compares Big SQL v5.0 with Spark 2.1 and focuses on 4-
stream run
© 2017 IBM Corporation19
What is TPC-DS?
 TPC = Transaction Processing Council
 Non-profit corporation (vendor independent)
 Defines various industry driven database benchmarks…. DS = Decision Support
 Models a multi-domain data warehouse environment for a hypothetical retailer
Retail Sales Web Sales Inventory Demographics Promotions
Multiple scale factors:
100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB
99 Pre-Defined Queries
Query Classes:
1.Reporting 4.Ad Hoc2.Iterative
OLAP
3.Data
Mining
TPC-DS now at Version 2.5 (http://www.tpc.org/tpcds/default.asp)
© 2017 IBM Corporation20
SO WHAT DOES IT TAKE TO BREACH 100TB
F1 CLUSTER: DESIGNED FOR SPARK
HARDWARE
28
LENOVO x3650 M5 NODES
100
GbE MELLANOX SN2700
448
TB SSD PCIe INTEL NVMe
1008
INTEL BROADWELL CORES
42,000
GB RAM
SOFTWARE
4.2 IOP
7.2 RHEL
2.1 SPARK
5.0 Big SQL
CLUSTER BANDWIDTH
375
GB/S NETWORK
480
GB/S DISK IO
ALL 4 SEASONS OF HOUSE OF CARDS +
ORANGE IS THE NEW BLACK
LOADED INTO RAM IN 1 SEC
DATA PREP
7
HOURS TO GENERATE 100TB RAW DATA
39
HOURS TO PARTITION AND LOAD (PARQUET)
39.7
TB ON-HDFS SIZE FOR PARQUET FILES
COMPRESSION
60%SPACE SAVED
WITH
PARQUET
ENERGY USE PER NODE
167 WATTS AT STAND-BY
560 WATTS AT PEAK LOAD
475 WATTS LOAD AVERAGE
PEAK CPU USAGE
96% TOTAL 73.7% USER
22% SYS <1% IOWAIT
© 2017 IBM Corporation21
5. HDFS Rebalance
Balance data across nodes to reduce
uneven load
3. MR2 Tuning (for data load)
7 properties
Map/reduce memory, java heap size,
io.sort.factor
4. Parquet <-> HDFS block alignment
to reduce unnecessary io ops
block size = 128MB
8. Big SQL Tuning (with elastic boost)
11 properties
(5 are now used as the defaults)
HADOOP-DS @ 100TB: TUNING THE STACK
7. Spark SQL Tuning
11 properties
3 spark-submit properties
6. YARN Tuning
10 properties
Mainly for container allocation
2. HDFS Tuning
5 properties
Mainly for NameNode
1. Low level machine tuning
file system optimization & mounts
network tuning
disable CPU scaling
Fulldetailsareinbackupslides
Generate data
Load data
Re-load data
Rebalance
Spark SQL queries
Big SQL queries
Basic cpu, io &
network throughput
tests
© 2017 IBM Corporation22
100TB Hadoop-DS is BIGdata
© 2017 IBM Corporation23
100TB Database Build
Parquet (with compression) was chosen as the storage
format for Big SQL and Spark SQL
Fact tables were partitioned (compliant) to
take advantage of new ‘partition elimination
thru join keys’ available in Big SQL v5
Both Big SQL and Spark SQL used exactly the same partitioned parquet files
Spark SQL did not require the Analyze & Stats View build stages
Load stage took ~39 hours
 STORE_SALES table is heavily skewed on null partition (SS_SOLD_DATE_SK=NULL)
 Most of time spent loading this null partition (~ 570GB, other partitions are
~20GB). In LOAD this is done by a single reducer
 INSERT..SELECT.. using multiple logical Big SQL workers may be faster (we ran out
of time before we could try it)
© 2017 IBM Corporation24
Query compliance through the scale factors
 Spark SQL has made impressive strides since v1.6 to
run all 99 TPC-DS compliant queries out of the box
 But this is only at the lower scale factors
 AT 100TB, 16 of the 99 queries fail with runtime
errors or timeout (> 10 hours)
 Big SQL has been successfully executing all 99
queries since Oct 2014
 IBM is the only vendor that has proven SQL
compatibility at scale factors up to 100TB
 With compliant DDL and query syntax
For an apples-to-apples comparison, the 83 queries which Spark
could successfully execute were used for the comparison.
Big SQLSpark SQL
© 2017 IBM Corporation25
Performance: Single Stream Run
 Single stream run represents a power run
 Interesting engineering exercise, but not representative of real life usage.
Big SQL is 3.8x faster than Spark 2.1 for single stream run
27859
18145
68735
0 10000 20000 30000 40000 50000 60000 70000 80000
BIG SQL V5.0
BIG SQL V5.0
SPARK SQL V2.1
Total elapsed time (secs)
Total elapsed time for Hadoop DS workload @ 100TB.
Single stream.
(shorter bars are better)
83 queries
99 queries
83 queries
© 2017 IBM Corporation26
Performance: 4 concurrent streams
 Multi-stream query execution most closely represents real-life usage.
 Analysis focus on 4-stream runs
Big SQL is 3.2x faster than Spark 2.1 for 4 concurrent streams
81329
49217
155515
0 50000 100000 150000 200000
BIG SQL V5.0
BIG SQL V5.0
SPARK SQL V2.1
Total elapsed time (secs)
Total elapsed time for Hadoop DS workload @ 100TB.
4 concurrent streams.
(shorter bars are better)
83 queries*4 streams = 332
99 queries*4 streams
= 396 queries
83 Q * 4 Strms
= 332 queries
© 2017 IBM Corporation27
CPU Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL uses almost 3x more system CPU.
These are wasted CPU cycles.
Average CPU Utilization:
76.4%
Average CPU Utilization:
88.2%
Spark SQL runs 3.2x longer than Big SQL –
so Spark SQL actually consumes > 3x
more CPU for the same workload!
 Average CPU consumption across 4-stream run for
Big SQL is 76.4% compared to 88.2% for Spark SQL
 Almost 3x more system CPU for Spark SQL. These
are wasted CPU cycles.
 Very little io wait time for both (SSDs are fast)
 Since Spark SQL run is 3.2x longer than Big SQL,
Spark SQL actually consumes more than 3x the
CPU resources to complete the same workload
 Big SQL:
Some nodes have higher CPU consumption than
others, showing imbalance in the distribution of
work amongst the nodes
 Spark SQL:
Spark SQL has even distribution of CPU across
nodes indicating work is more evenly distributed
 Big SQL is much more efficient in how it uses the
available CPU resources.
© 2017 IBM Corporation28
Big SQL vs Spark SQL Memory Consumption
Hadoop-DS @ 100TB, 4 Concurrent Streams
 Big SQL is only ‘actively’ using approx. 1/3rd of
the available memory
 Indicating more memory could be assigned
to bufferpools and sort space etc…
 So could Big SQL be even faster and/or
support greater concurrency !!!
 Spark SQL is doing a better job at utilizing the
available memory, but consequently has less
room for improvement via tuning
Big SQL
Spark SQL
Active
Inactive
Free
© 2017 IBM Corporation29
I/O Activity: 4-streams.
 Indicates Spark SQL needs to do
more I/O to complete the
workload, but when high I/O
throughput is required, Big SQL
can drive the SSDs harder than
Spark SQL
 Spark SQL is performing more I/O than Big SQL
 Since Spark SQL run lasts 3.2x longer than Big SQL, Spark
SQL is actually reading ~12x more data than Big SQL
and writing ~30x more data.
Indicates greater efficiency within the mature Big
SQL optimizer & execution engine.
© 2017 IBM Corporation30
PERFORMANCE
Big SQL 3.2x faster
4 concurrent query streams.
HADOOP-DS @ 100TB: SUMMARY
WORKING QUERIES
CPU (vs Spark)
Big SQL uses 3.7x less CPU
I/O (vs Spark)
Big SQL reads 12x less data
Big SQL writes 30x less data
COMPRESSION
60%SPACE SAVED
WITH
PARQUET
AVERAGE CPU:
76.4%
MAX I/O THROUGHPUT
(per node):
READ 4.4 GB/sec
WRITE 2.8 GB/sec
© 2017 IBM Corporation31
Lessons Learned: General!
 Building a brand-new cluster from the ground up is tough!
 Full stack tuning required in order to get the cluster to a state capable of
handling 100TB
 Pay close attention to how the data is loaded, and think carefully about the partitioning scheme
 Be cognizant of data skew - especially on your partitioning scheme
Concurrency is much more difficult than single stream
Complex queries pose a significant problem for most SQL over Hadoop
solutions at scale
 Near best performance often achieved in the first 5-8 runs, absolute best may take much longer – the
80:20 rule.
© 2017 IBM Corporation32
Lessons Learned: Spark SQL!
 Spark SQL has come a long way, very quickly. BUT…
 Success at lower scale factors does not guarantee success at higher scale
factors
 Significant effort required to tune failing queries
Spark SQL still relies heavily (almost entirely) on manual tuning…
 To get the best out of Spark SQL, the level of parallelism (num-executors) and memory assigned
to them (executor-memory) needs to be tuned for each query, and the optimal values vary
depending on how many other Spark queries are running in the cluster at that particular time.
 Very difficult, if not impossible, to manage this in a production environment
© 2017 IBM Corporation33
Lessons Learned: Big SQL!
 4 query failures in early runs using product defaults:
 Quickly eliminated via product tuning
 Big SQL defaults changed as a result
 Focused on hardening “Elastic Boost” capability to gain maximum throughput
 Extensive development work in the Scheduler to evenly distribute work amongst the logical
workers
 Spare capacity (memory, cpu) could be better utilized
 Could have done better!
 Big SQL has unique tuning features to help with stubborn queries
 Only a limited set of these are allowed by the Hadoop-DS rules, but could be deployed in
production clusters
© 2017 IBM Corporation34
ORC performance evaluation
V5.0.1
V2.1
LLAP on TEZ
Hadoop-DS @ 10TB
Load
Single
Stream
HARDWARE
17
LENOVO x3650 M4 NODES
640
LOGICAL CORES
2,048
GB RAM
288
TB DISK SPACE
10
Gb ETHERNET6-
Streams
© 2017 IBM Corporation35
PERFORMANCE: 6-streams
BIG SQL 2.3X FASTER
HADOOP-DS @ 10TB BIG SQL V5.0.1 AND HIVE 2.1 (LLAP WITH TEZ)
AT A GLANCE: 85 COMMON QUERIES
WORKING COMPLIANT QUERIES:
6-streams
WORKLOAD
SCALE FACTOR: 10 TB
FILE FORMAT: ORC (ZLIB)
CONCURRENCY: 6 STREAMS
QUERY SUBSET: 85 QUERIES
RESOURCE UTILIZATION:
6-STREAMS
Big SQL: 1.5x FEWER CPU CYCLES
USED
STACK
HDP 2.6.1
BIG SQL 5.0.1
HIVE 2.1 LLAP ON TEZ
INTERESTING FACTS
FASTEST QUERY
5.4X FASTER (BIG SQL: 1.5 SEC,
HIVE: 8.1 SEC)
SLOWEST QUERY (QUERY 67)
1.7X FASTER (BIG SQL: 6827 SEC,
HIVE: 11830 SEC)
BIG SQL FASTER FOR 80% OF
QUERIES RUN
PERFORMANCE: 1-stream
BIG SQL 1.8X FASTER
hrs
hrs
© 2017 IBM Corporation36
WHY ???
Advanced Autonomics
Self Tuning
Memory Manager
Integrated Work Load
Manager
World Class Cost Based Optimizer
Query rewrite
Advanced Statistics
Advanced Partitioning
Native Row &
Columnar stores
Hardened runtime
Elastic Boost
SQL Compatibility
© 2017 IBM Corporation37
So, what does all this boil down to ?
 Data Scientists/Business Analysts can be 3-4 times more productive using Big SQL compared to Spark
SQL.
 With Big SQL, users can focus on what they want to do, and not worry about how it is executed.
Proof points:
 Able to successfully run all 99 TPC-DS queries @ 100TB in 4-concurrent streams
 Performance leadership
 Uses fewer cluster resources
 Simpler configuration with mature self-tuning and workload management features
 Big SQL is the best SQL over Hadoop engine for complex analytical workloads
 No one else has published @100TB (or anywhere close)
© 2017 IBM Corporation38
Questions?
https://developer.ibm.com/hadoop/category/bigsql/
© 2017 IBM Corporation39
Thank you!
© 2017 IBM Corporation
Backup slides
© 2017 IBM Corporation41
SQL over
Hadoop
use cases
SQL
Adhoc data
preparation
for analytics.
Federation
Transactional
with fast
lookups.
Fewer users.
Adhoc
queries and
discovery.
ELT and
simple, large
scale
queries.
Complex
SQL. Many
users. Deep
analytics
Operational
Data Store
Need to balance “best tool
for the job” paradigm with
maintainability and support
costs. Big SQL
Hive Spark SQL
Big SQL
Hbase
Big SQL
Phoenix
Spark SQL
Hive
Big SQL
Hbase
Big SQL
Phoenix
But SQL can do so much more…
© 2017 IBM Corporation42
Big SQL v5 + YARN Integration
Dynamic Allocation / Release of Resources
Big SQL Head
NM NM NM NM NM NM
Slider Client
YARN
Resource
Manager
& Scheduler
Big SQL
AM
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Container
YARN
components
Slider
Components
Big SQL
Components
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Stopped workers
release memory
to YARN for
other jobs
Stopped workers
release memory
to YARN for
other jobs
Stopped workers
release memory
to YARN for other
jobs
Big SQL Slider package
implements Slider Client APIs
HDFS
© 2017 IBM Corporation43
Big SQL v5 Elastic Boost – Multiple Workers per Host
More Granular Elasticity
NM NM NM NM NM NM
YARN
Resource
Manager
& Scheduler
Big SQL
AM
Container
YARN components
Slider Components
Big SQL Components
Big SQL Head
Slider Client
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
HDFS
© 2017 IBM Corporation44
Cluster Details (F1 Spark cluster)
Designed for Spark
 Totals across all Cluster Data nodes
 1,080 cores, 2,160 threads
 45TB memory
 100TB database storage replicated 3X plus temp, 480TB raw, 240 MVMe
 Hardware Details
 100TB SQL Database requires a 28-node cluster
• 2 management nodes (Lenovo x3650 M5), co-located with data nodes
• 28 data nodes (Lenovo x3650 M5)
• 2 racks, 20x 2U servers per rack (42U racks)
• 1 switch, 100GbE, 32 ports, 1U, (Mellanox SN2700)
 Each data node
 CPU: 2x E5-2697 v4 @ 2.3GHz (Broardwell) (18c) Passmark: 23,054 / 46,108
 Memory: 1.5TB per server 24x 64GB DDR4 2400MHz
 Flash Storage: 8x 2TB SSD PCIe MVMe (Intel DC P3700), 16TB per server
 Network: 100GbE adapter (Mellanox ConnectX-5 EN)
 IO Bandwidth per server: 16GB/s, Network bandwidth 12.5GB/s
 IO Bandwidth per cluster: 480GB/s, Network bandwidth 375GB/s
© 2017 IBM Corporation45
Query compliance through the scale factors (cont)
Almost half (7) of the Spark SQL 2.1 queries which fail at 100TB
can be classified as complex in nature
 No surprise since Spark is a relatively immature technology
 In-line with findings from original Hadoop-DS work in 2014
Big SQL RDBMs heritage is the key to providing Enterprise
grade SQL for complex analytical workloads
Any query which does not complete, requires
modification or tuning impacts business
productivity and wastes valuable human &
machine resources
© 2017 IBM Corporation46
Hadoop Tuning
HDFS Setting Default 100TB
NameNode Java heap 4G 20G
NameNode new generation size 512 2.5G
NameNode maximum new generation size 512 2.5G
Hadoop maximum Java heap size 4G 10G
DataNode max data transfer threads
(helps HDFS data rebalance)
4096 16384
MR2 Settings, applicable to load operations Default 100TB
MapReduce Framework map memory 2G 35G
MapReduce Framework reduce memory 4G 69G
MapReduce Sort Allocation Memory
(helps with hdfs rebalancing)
1G 2G
MR Map Java Heap Size (MB) 1638 28262
MR Reduce Java Heap Size (MB) 7840 56524
mapreduce.jobhistory.intermediate-done-dir /var /data15/var
mapreduce.task.io.sort.factor 100 1000
YARN setting Default 100TB
Percentage of physical CPU allocated for all containers 80% 90%
Number of virtual cores 57 (80%) 72
Container - Minimum container size 44G 20G
ResourceManager Java heap size 1G 8G
NodeManager Java heap size 1G 8G
AppTimelineServer Java heap size 1G 8G
YARN Java heap size 1G 2G
Advanced: Fault Tolerance yarn.resourcemanager.connect.retry-
interval.ms
30000 250
Advanced: Advanced yarn-site, hadoop.registry.rm.enabled False True
Advanced: Advanced yarn-site, yarn.client.nodemanager-connect.retry-
interval-ms
10000 250
© 2017 IBM Corporation47
SPARK Tuning
Spark Setting Default 10TB 100TB
spark.rpc.askTimeout (s) 120 1200 36000
spark.kryoserializer.buffer.max (mb) 64 768 768
spark.yarn.executor.memoryOverhead (mb) 384 1384 8192
spark.driver.maxResultSize 1G 8G 40G
spark.local.dir /tmp /data[1-10]/tmp /data[1-10]/tmp
spark.network.timeout 120 1200 36000
spark.sql.broadcastTimeout 120 1600 36000
spark.buffer.pageSize computed computed 64m
spark.shuffle.file.buffer computed computed 512k
spark.memory.fraction 0.6 0.8 0.8
spark.scheduler.listenerbus.eventqueue.size 10K 120K 600K
© 2017 IBM Corporation48
Big SQL Tuning
Big SQL Setting Default 100TB
Big SQL Workers per Node 1 12
INSTANCE_MEMORY 25% 97%
DB2_CPU_BINDING 25% MACHINE_SHARE=94
DB2_EXT_TABLE_SETTINGS DFSIO_MEM_RESERVE:20 DFSIO_MEM_RESERVE:0
DFT_DEGREE 8 4
SORTHEAP
SHEAPTHRES_SHR
Computed
Computed
4.4 G
70 G
BufferPool Size Computed 15 G
scheduler.cache.splits false true
scheduler.assignment.algorithm GREEDY MLN_RANDOMIZED
scheduler.dataLocationCount Computed max:28
scheduler.maxWorkerThreads Computed 8192
Green highlights the defaults will be changed in v5.0
© 2017 IBM Corporation49
Publisher Date Product TPC-DS Queries Data Vol
Cloudera Sept 2016 Impala 2.6 on AWS
Claims 42% more performant than
AWS Redshift
70 query subset 3TB
Cloudera August
2016
Impala 2.6
Claims 22% faster for TPC-DS
than previous version
17 queries
referenced
Not specified
Cloudera April 2016 Impala 2.5
Claims 4.3x faster for TPC-DS
than previous version
24 query subset 15TB *2
Hortonworks July 2016 Hive 2.1 with LLAP
Claims 25x faster for TPC-DS than
Hive 1.2
15 query subset 1TB
Radiant
Advisors *1
June 2016 Impala 2.5 on CDH 62 successful
37 fail
100GB / 1TB
Radiant
Advisors *1
June 2016 Presto .141t on Teradata Hadoop
Appliance (HDP, CDH)
78 successful
21 fail
100GB / 1TB
Radiant
Advisors *1
June 2016 Hive 1.2.1, Tez 0.7.0 on HDP 63 successful
35 fail
100GB / 1TB
What About Other SQL Hadoop TPC-DS Benchmarks?
No other vendor has demonstrated ability to execute all 99 TPC-DS queries -
even at lower scale factors.

More Related Content

What's hot

Open Innovation with Power Systems
Open Innovation with Power Systems Open Innovation with Power Systems
Open Innovation with Power Systems
IBM Power Systems
 
GIS Into to Cloud Microsoft Azure
GIS  Into  to Cloud Microsoft Azure GIS  Into  to Cloud Microsoft Azure
GIS Into to Cloud Microsoft Azure
Usama Wahab Khan Cloud, Data and AI
 
Enabling the Software Defined Data Center for Hybrid IT
Enabling the Software Defined Data Center for Hybrid ITEnabling the Software Defined Data Center for Hybrid IT
Enabling the Software Defined Data Center for Hybrid IT
NetApp
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
ModusOptimum
 
Oracle Cloud Infrastructure
Oracle Cloud InfrastructureOracle Cloud Infrastructure
Oracle Cloud Infrastructure
MarketingArrowECS_CZ
 
Oracle Database Consolidation with FlexPod on Cisco UCS
Oracle Database Consolidation with FlexPod on Cisco UCSOracle Database Consolidation with FlexPod on Cisco UCS
Oracle Database Consolidation with FlexPod on Cisco UCS
NetApp
 
Oracle GoldenGate Cloud Service Overview
Oracle GoldenGate Cloud Service OverviewOracle GoldenGate Cloud Service Overview
Oracle GoldenGate Cloud Service Overview
Jinyu Wang
 
MOUG17: Three Types of Table Compression
MOUG17: Three Types of Table CompressionMOUG17: Three Types of Table Compression
MOUG17: Three Types of Table Compression
Monica Li
 
Využijte svou Oracle databázi naplno
Využijte svou Oracle databázi naplnoVyužijte svou Oracle databázi naplno
Využijte svou Oracle databázi naplno
MarketingArrowECS_CZ
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3
Eric Rice
 
Enterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to SuccessEnterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to Success
Hortonworks
 
Oracle Cloud
Oracle CloudOracle Cloud
Oracle Cloud
MarketingArrowECS_CZ
 
Sesion covergentes 2016
Sesion covergentes 2016Sesion covergentes 2016
Sesion covergentes 2016
Fran Navarro
 
A2 run vmware_workloads_on_public_cloud-without_any_change
A2 run vmware_workloads_on_public_cloud-without_any_changeA2 run vmware_workloads_on_public_cloud-without_any_change
A2 run vmware_workloads_on_public_cloud-without_any_change
Dr. Wilfred Lin (Ph.D.)
 
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud ServiceOracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
Jean-Philippe PINTE
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
ModusOptimum
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Cynthia Saracco
 
Red Hat Container Strategy
Red Hat Container StrategyRed Hat Container Strategy
Red Hat Container Strategy
Red Hat Events
 
IBM Cloud pak for data brochure
IBM Cloud pak for data   brochureIBM Cloud pak for data   brochure
IBM Cloud pak for data brochure
Simon Harrison ACMA CGMA
 
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
Lucas Jellema
 

What's hot (20)

Open Innovation with Power Systems
Open Innovation with Power Systems Open Innovation with Power Systems
Open Innovation with Power Systems
 
GIS Into to Cloud Microsoft Azure
GIS  Into  to Cloud Microsoft Azure GIS  Into  to Cloud Microsoft Azure
GIS Into to Cloud Microsoft Azure
 
Enabling the Software Defined Data Center for Hybrid IT
Enabling the Software Defined Data Center for Hybrid ITEnabling the Software Defined Data Center for Hybrid IT
Enabling the Software Defined Data Center for Hybrid IT
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
 
Oracle Cloud Infrastructure
Oracle Cloud InfrastructureOracle Cloud Infrastructure
Oracle Cloud Infrastructure
 
Oracle Database Consolidation with FlexPod on Cisco UCS
Oracle Database Consolidation with FlexPod on Cisco UCSOracle Database Consolidation with FlexPod on Cisco UCS
Oracle Database Consolidation with FlexPod on Cisco UCS
 
Oracle GoldenGate Cloud Service Overview
Oracle GoldenGate Cloud Service OverviewOracle GoldenGate Cloud Service Overview
Oracle GoldenGate Cloud Service Overview
 
MOUG17: Three Types of Table Compression
MOUG17: Three Types of Table CompressionMOUG17: Three Types of Table Compression
MOUG17: Three Types of Table Compression
 
Využijte svou Oracle databázi naplno
Využijte svou Oracle databázi naplnoVyužijte svou Oracle databázi naplno
Využijte svou Oracle databázi naplno
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3
 
Enterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to SuccessEnterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to Success
 
Oracle Cloud
Oracle CloudOracle Cloud
Oracle Cloud
 
Sesion covergentes 2016
Sesion covergentes 2016Sesion covergentes 2016
Sesion covergentes 2016
 
A2 run vmware_workloads_on_public_cloud-without_any_change
A2 run vmware_workloads_on_public_cloud-without_any_changeA2 run vmware_workloads_on_public_cloud-without_any_change
A2 run vmware_workloads_on_public_cloud-without_any_change
 
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud ServiceOracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Red Hat Container Strategy
Red Hat Container StrategyRed Hat Container Strategy
Red Hat Container Strategy
 
IBM Cloud pak for data brochure
IBM Cloud pak for data   brochureIBM Cloud pak for data   brochure
IBM Cloud pak for data brochure
 
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
 

Similar to Breaching the 100TB Mark with SQL Over Hadoop

Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
DataWorks Summit
 
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?
US-Analytics
 
Deliver Secure SQL Access for Enterprise APIs - August 29 2017
Deliver Secure SQL Access for Enterprise APIs - August 29 2017Deliver Secure SQL Access for Enterprise APIs - August 29 2017
Deliver Secure SQL Access for Enterprise APIs - August 29 2017
Nishanth Kadiyala
 
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
DataWorks Summit
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
ModusOptimum
 
18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics
18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics
18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics
Cedar Consulting
 
Spark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny AppsSpark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny Apps
Data Con LA
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
ModusOptimum
 
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
tdc-globalcode
 
JeffRichardsonResume2016
JeffRichardsonResume2016JeffRichardsonResume2016
JeffRichardsonResume2016Jeff Richardson
 
IMS08 the momentum driving the ims future
IMS08   the momentum driving the ims futureIMS08   the momentum driving the ims future
IMS08 the momentum driving the ims future
Robert Hain
 
Cloud Integration with Database.com and Heroku
Cloud Integration with Database.com and HerokuCloud Integration with Database.com and Heroku
Cloud Integration with Database.com and HerokuSalesforce Developers
 
E-Business Suite on Oracle Cloud
E-Business Suite on Oracle CloudE-Business Suite on Oracle Cloud
E-Business Suite on Oracle Cloud
Keith Kiattipong
 
Ibm Cognos B Iund Pmfj
Ibm Cognos B Iund PmfjIbm Cognos B Iund Pmfj
Ibm Cognos B Iund Pmfj
Friedel Jonker
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overview
Keshav Murthy
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Jeffrey T. Pollock
 
2010/10 - Database Architechs presentation
2010/10 - Database Architechs presentation2010/10 - Database Architechs presentation
2010/10 - Database Architechs presentation
Database Architechs
 
Libera la potenza del Machine Learning
Libera la potenza del Machine LearningLibera la potenza del Machine Learning
Libera la potenza del Machine Learning
Jürgen Ambrosi
 
What's new in IBM Informix 12.1?
What's new in IBM Informix 12.1?What's new in IBM Informix 12.1?
What's new in IBM Informix 12.1?Keshav Murthy
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virendervithakur
 

Similar to Breaching the 100TB Mark with SQL Over Hadoop (20)

Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
 
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?
 
Deliver Secure SQL Access for Enterprise APIs - August 29 2017
Deliver Secure SQL Access for Enterprise APIs - August 29 2017Deliver Secure SQL Access for Enterprise APIs - August 29 2017
Deliver Secure SQL Access for Enterprise APIs - August 29 2017
 
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics
18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics
18. Madhur Hemnani - Result Orientated Innovation with Oracle HR Analytics
 
Spark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny AppsSpark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny Apps
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
 
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
TDC2017 | POA Trilha BigData - IBM BigSQL - Engine de consulta de dados de al...
 
JeffRichardsonResume2016
JeffRichardsonResume2016JeffRichardsonResume2016
JeffRichardsonResume2016
 
IMS08 the momentum driving the ims future
IMS08   the momentum driving the ims futureIMS08   the momentum driving the ims future
IMS08 the momentum driving the ims future
 
Cloud Integration with Database.com and Heroku
Cloud Integration with Database.com and HerokuCloud Integration with Database.com and Heroku
Cloud Integration with Database.com and Heroku
 
E-Business Suite on Oracle Cloud
E-Business Suite on Oracle CloudE-Business Suite on Oracle Cloud
E-Business Suite on Oracle Cloud
 
Ibm Cognos B Iund Pmfj
Ibm Cognos B Iund PmfjIbm Cognos B Iund Pmfj
Ibm Cognos B Iund Pmfj
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overview
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
 
2010/10 - Database Architechs presentation
2010/10 - Database Architechs presentation2010/10 - Database Architechs presentation
2010/10 - Database Architechs presentation
 
Libera la potenza del Machine Learning
Libera la potenza del Machine LearningLibera la potenza del Machine Learning
Libera la potenza del Machine Learning
 
What's new in IBM Informix 12.1?
What's new in IBM Informix 12.1?What's new in IBM Informix 12.1?
What's new in IBM Informix 12.1?
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virender
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Breaching the 100TB Mark with SQL Over Hadoop

  • 1. © 2017 IBM CorporationSept 2017 Breaching the 100TB mark with SQL over Hadoop Analytics Performance .1 Highlights L’s leadership in performance and scalability ovements ts partitioning options ments mory and caching improvements s with ORC file format ther on these enhancements by allowing a Simon Harris (siharris@au1.ibm.com) IBM Research Priya Tiruthani (ntiruth@us.ibm.com) Big SQL Offering Manager
  • 2. © 2017 IBM Corporation2 Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2017. All rights reserved. — U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others.
  • 3. © 2017 IBM Corporation3 Big SQL is the only SQL-on-Hadoop solution to understand SQL syntax from other vendors and products, including: Oracle, IBM Db2 and Netezza. For this reason, Big SQL is the ultimate hybrid engine to optimize EDW workloads on an open Hadoop platform What is IBM Big SQL?
  • 4. © 2017 IBM Corporation4 Want to modernize your EDW without long and costly migration efforts Offloading historical data from Oracle, Db2, Netezza because reaching capacity Operationalize machine learning Need to query, optimize and integrate multiple data sources from one single endpoint Slow query performance for SQL workloads Require skill set to migrate data from RDBMS to Hadoop / Hive Do you have any of these challenges?
  • 5. © 2017 IBM Corporation5 Here’s How Big SQL Addresses These Challenges  Compatible with Oracle, Db2 & Netezza SQL syntax  Modernizing EDW workloads on Hadoop has never been easier  Application portability (eg: Cognos, Tableau, MicroStrategy,…)  Federates all your data behind a single SQL engine  Query Hive, Spark and HBase data from a single endpoint  Federate your Hadoop data using connectors to Teradata, Oracle, Db2 & more  Query data sources that have Spark connectors  Addresses a skill set gap needed to migrate technologies  Delivers high performance & concurrency for BI workloads  Unlock Hadoop data with analytics tools of choice  Provides greater security while accessing data  Robust role-based access control and Ranger integration  Operationalize machine learning through integration with Spark  Bi-directional integration with Spark exploits Spark’s connectors as well as ML capabilities
  • 6. © 2017 IBM Corporation6 IBM’s Big SQL Preserves Open Source Foundation Leverages Hive metastore and storage formats. No Lock-in. Data part of Hadoop, not Big SQL. SQL Execution Engines Big SQL (IBM) Hive (Open Source) Hive Storage Model (open source) CSV Parquet ORC Others…Tab Delim. Hive Metastore (open source) Applications
  • 7. © 2017 IBM Corporation7 Big SQL queries heterogeneous systems in a single query - only SQL-on-Hadoop that virtualizes more than 10 different data sources: RDBMS, NoSQL, HDFS or Object Store Big SQL Fluid Query (federation) Oracle SQL Server Teradata DB2 Netezza (PDA) Informix Microsoft SQL Server Hive HBase HDFS Object Store (S3) WebHDFS Big SQL allows query federation by virtualizing data sources and processing where data resides Hortonworks Data Platform (HDP) Data Virtualization
  • 8. © 2017 IBM Corporation8  Easy porting of enterprise applications  Ability to work seamlessly with Business Intelligence tools like Cognos to gain insights  Big SQL integrates with Information Governance Catalog by enabling easy shared imports to InfoSphere Metadata Asset Manager, which allows:  Analyze assets  Utilize assets in jobs  Designate stewards for the assets Oracle SQL DB2 SQL Netezza SQL Big SQL SQL syntax tolerance (ANSI SQL Compliant) Cognos Analytics InfoSphere Metadata Asset Manager Big SQL is a synergetic SQL engine that offers SQL compatibility, portability and collaborative ability to get composite analysis on data Data Offloading and Analytics
  • 9. © 2017 IBM Corporation9 BRANCH_A FINANCE (security admin)BRANCH_B Role Based Access Control enables separation of Duties / Audit Row Level Security Row and Column Level Security Big SQL offers row and column level access control (RBAC) among other security settings Data Security
  • 10. © 2017 IBM Corporation10 Themes Integration Performance Usability & Serviceability Enterprise, Governance & Security Bringing together different components of the Hadoop and making sure Big SQL offers enhanced capabilities and smooth. Execution of queries, simple or complex, needs to complete with low latency. Big SQL continues to focus on improving the query execution for all open source file formats. By simplifying the complexity of setting up or trouble-shooting that comes with Hadoop ecosystem, users will benefit by increases the productivity of their use cases. Enterprise needs are specific to application portability and data security. Big SQL has high application portability and makes strides to enhance it further. Big SQL also focuses on having centralized governance and auditing for.
  • 11. © 2017 IBM Corporation11 • At a High Level: • Bi-directional Spark integration allows you to run Spark jobs from Big SQL • Ranger integration provide centralized security • Yarn integration for easy flex up/down of Big SQL workers • Integration with HDP v2.6.x Bi-directional Spark integration+ Ranger integration+ HDP integration+ Integration Big SQL v5.0.x focuses on providing integration with the following… YARN integration+
  • 12. © 2017 IBM Corporation12 • At a High Level: • Launch multiple Big SQL workers on a node using Turbo Boost technology for SQL execution • Enhancements in handling ORC file format has shown marked increase in performance (at par with parquet file format) • New benchmark shows great performance ORC enhancement+ Performance Constant upgrades to performance helps Big SQL to lead in performance for complex queries Performance benchmarks+ Elastic Boost technology+
  • 13. © 2017 IBM Corporation13 • At a High Level: • Oracle compatibility for application portability • Ranger integration provide centralized security Oracle compatibility+ Ranger integration+ Enterprise, Governance & Security
  • 14. © 2017 IBM Corporation14 • At a High Level: • UI driven simple install with a few clicks • Data loaded for immediate use • Tutorial guided using Zeppelin Sandbox+ Big SQL Sandbox A single node sandbox to visualize data using Zeppelin
  • 15. © 2017 IBM Corporation15 Right Tool for the Right Job Not Mutually Exclusive. Hive, Big SQL & Spark SQL can co-exist and complement each other in a cluster Big SQL Federation Complex Queries High Concurrency Enterprise ready Application portability All open source file formats Spark SQL Machine learning Data exploration Simpler SQL Hive In-memory cache Geospatial analytics ACID capabilities Fast ingest Ideal tool for Data Scientists and discovery Ideal tool for BI Data Analysts and production workloads Ideal tool for BI Data Analysts and production workloads
  • 16. © 2017 IBM Corporation16 To Summarize - Core Themes of Big SQL SQL Compatibility •Understands different SQL dialects •Reuse skills and applications with less/no changes Federation •Connect to remote data sources •Query pushdown •Spark connectors for more data sources & ML models Performance •Execute all 99 TPCDS queries •Scales linearly with increased concurrency Enterprise & Security •Automatic memory management •Role/column based data security •SQL compatible with: •Applications work as-is without any changes •Federates to more than 10 data sources: RDBMS, NoSQL and/or Object Stores •Integrates bi-directionally with Spark, like no other •Operationalizes ML models •Exhibits high performance even when data scales up to 100TB with complex SQLs •Handles many concurrent users without relinquishing performance •Secures data using SQL with roles •Integrates with Ranger for centralized management Big SQL is the only SQL-on-Hadoop engine that……
  • 17. © 2017 IBM Corporation17 Hadoop-DS @ 100TB Breaching the 100TB mark: The Environment F1 ClusterLoad Single Stream 4-Streams
  • 18. © 2017 IBM Corporation18 About the Hadoop-DS Workload Aim: To provide the fairest and most meaningful comparison of SQL over Hadoop solutions Hadoop-DS benchmark is based on the TPC-DS* benchmark. Strives to follow latest (v2.3) TPC-DS specification whenever possible. Key deviations:  No data maintenance or data persistence phases - not possible across all vendors  Uses a sub-set of queries that all solutions can successfully execute at that scale factor  Queries are not cherry picked Is STILL the most complete TPC-DS like benchmark executed so far Includes database build, single stream run and multi-stream run First published in Oct 2014, using Big SQL, Impala and Hive This publication compares Big SQL v5.0 with Spark 2.1 and focuses on 4- stream run
  • 19. © 2017 IBM Corporation19 What is TPC-DS?  TPC = Transaction Processing Council  Non-profit corporation (vendor independent)  Defines various industry driven database benchmarks…. DS = Decision Support  Models a multi-domain data warehouse environment for a hypothetical retailer Retail Sales Web Sales Inventory Demographics Promotions Multiple scale factors: 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB 99 Pre-Defined Queries Query Classes: 1.Reporting 4.Ad Hoc2.Iterative OLAP 3.Data Mining TPC-DS now at Version 2.5 (http://www.tpc.org/tpcds/default.asp)
  • 20. © 2017 IBM Corporation20 SO WHAT DOES IT TAKE TO BREACH 100TB F1 CLUSTER: DESIGNED FOR SPARK HARDWARE 28 LENOVO x3650 M5 NODES 100 GbE MELLANOX SN2700 448 TB SSD PCIe INTEL NVMe 1008 INTEL BROADWELL CORES 42,000 GB RAM SOFTWARE 4.2 IOP 7.2 RHEL 2.1 SPARK 5.0 Big SQL CLUSTER BANDWIDTH 375 GB/S NETWORK 480 GB/S DISK IO ALL 4 SEASONS OF HOUSE OF CARDS + ORANGE IS THE NEW BLACK LOADED INTO RAM IN 1 SEC DATA PREP 7 HOURS TO GENERATE 100TB RAW DATA 39 HOURS TO PARTITION AND LOAD (PARQUET) 39.7 TB ON-HDFS SIZE FOR PARQUET FILES COMPRESSION 60%SPACE SAVED WITH PARQUET ENERGY USE PER NODE 167 WATTS AT STAND-BY 560 WATTS AT PEAK LOAD 475 WATTS LOAD AVERAGE PEAK CPU USAGE 96% TOTAL 73.7% USER 22% SYS <1% IOWAIT
  • 21. © 2017 IBM Corporation21 5. HDFS Rebalance Balance data across nodes to reduce uneven load 3. MR2 Tuning (for data load) 7 properties Map/reduce memory, java heap size, io.sort.factor 4. Parquet <-> HDFS block alignment to reduce unnecessary io ops block size = 128MB 8. Big SQL Tuning (with elastic boost) 11 properties (5 are now used as the defaults) HADOOP-DS @ 100TB: TUNING THE STACK 7. Spark SQL Tuning 11 properties 3 spark-submit properties 6. YARN Tuning 10 properties Mainly for container allocation 2. HDFS Tuning 5 properties Mainly for NameNode 1. Low level machine tuning file system optimization & mounts network tuning disable CPU scaling Fulldetailsareinbackupslides Generate data Load data Re-load data Rebalance Spark SQL queries Big SQL queries Basic cpu, io & network throughput tests
  • 22. © 2017 IBM Corporation22 100TB Hadoop-DS is BIGdata
  • 23. © 2017 IBM Corporation23 100TB Database Build Parquet (with compression) was chosen as the storage format for Big SQL and Spark SQL Fact tables were partitioned (compliant) to take advantage of new ‘partition elimination thru join keys’ available in Big SQL v5 Both Big SQL and Spark SQL used exactly the same partitioned parquet files Spark SQL did not require the Analyze & Stats View build stages Load stage took ~39 hours  STORE_SALES table is heavily skewed on null partition (SS_SOLD_DATE_SK=NULL)  Most of time spent loading this null partition (~ 570GB, other partitions are ~20GB). In LOAD this is done by a single reducer  INSERT..SELECT.. using multiple logical Big SQL workers may be faster (we ran out of time before we could try it)
  • 24. © 2017 IBM Corporation24 Query compliance through the scale factors  Spark SQL has made impressive strides since v1.6 to run all 99 TPC-DS compliant queries out of the box  But this is only at the lower scale factors  AT 100TB, 16 of the 99 queries fail with runtime errors or timeout (> 10 hours)  Big SQL has been successfully executing all 99 queries since Oct 2014  IBM is the only vendor that has proven SQL compatibility at scale factors up to 100TB  With compliant DDL and query syntax For an apples-to-apples comparison, the 83 queries which Spark could successfully execute were used for the comparison. Big SQLSpark SQL
  • 25. © 2017 IBM Corporation25 Performance: Single Stream Run  Single stream run represents a power run  Interesting engineering exercise, but not representative of real life usage. Big SQL is 3.8x faster than Spark 2.1 for single stream run 27859 18145 68735 0 10000 20000 30000 40000 50000 60000 70000 80000 BIG SQL V5.0 BIG SQL V5.0 SPARK SQL V2.1 Total elapsed time (secs) Total elapsed time for Hadoop DS workload @ 100TB. Single stream. (shorter bars are better) 83 queries 99 queries 83 queries
  • 26. © 2017 IBM Corporation26 Performance: 4 concurrent streams  Multi-stream query execution most closely represents real-life usage.  Analysis focus on 4-stream runs Big SQL is 3.2x faster than Spark 2.1 for 4 concurrent streams 81329 49217 155515 0 50000 100000 150000 200000 BIG SQL V5.0 BIG SQL V5.0 SPARK SQL V2.1 Total elapsed time (secs) Total elapsed time for Hadoop DS workload @ 100TB. 4 concurrent streams. (shorter bars are better) 83 queries*4 streams = 332 99 queries*4 streams = 396 queries 83 Q * 4 Strms = 332 queries
  • 27. © 2017 IBM Corporation27 CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL uses almost 3x more system CPU. These are wasted CPU cycles. Average CPU Utilization: 76.4% Average CPU Utilization: 88.2% Spark SQL runs 3.2x longer than Big SQL – so Spark SQL actually consumes > 3x more CPU for the same workload!  Average CPU consumption across 4-stream run for Big SQL is 76.4% compared to 88.2% for Spark SQL  Almost 3x more system CPU for Spark SQL. These are wasted CPU cycles.  Very little io wait time for both (SSDs are fast)  Since Spark SQL run is 3.2x longer than Big SQL, Spark SQL actually consumes more than 3x the CPU resources to complete the same workload  Big SQL: Some nodes have higher CPU consumption than others, showing imbalance in the distribution of work amongst the nodes  Spark SQL: Spark SQL has even distribution of CPU across nodes indicating work is more evenly distributed  Big SQL is much more efficient in how it uses the available CPU resources.
  • 28. © 2017 IBM Corporation28 Big SQL vs Spark SQL Memory Consumption Hadoop-DS @ 100TB, 4 Concurrent Streams  Big SQL is only ‘actively’ using approx. 1/3rd of the available memory  Indicating more memory could be assigned to bufferpools and sort space etc…  So could Big SQL be even faster and/or support greater concurrency !!!  Spark SQL is doing a better job at utilizing the available memory, but consequently has less room for improvement via tuning Big SQL Spark SQL Active Inactive Free
  • 29. © 2017 IBM Corporation29 I/O Activity: 4-streams.  Indicates Spark SQL needs to do more I/O to complete the workload, but when high I/O throughput is required, Big SQL can drive the SSDs harder than Spark SQL  Spark SQL is performing more I/O than Big SQL  Since Spark SQL run lasts 3.2x longer than Big SQL, Spark SQL is actually reading ~12x more data than Big SQL and writing ~30x more data. Indicates greater efficiency within the mature Big SQL optimizer & execution engine.
  • 30. © 2017 IBM Corporation30 PERFORMANCE Big SQL 3.2x faster 4 concurrent query streams. HADOOP-DS @ 100TB: SUMMARY WORKING QUERIES CPU (vs Spark) Big SQL uses 3.7x less CPU I/O (vs Spark) Big SQL reads 12x less data Big SQL writes 30x less data COMPRESSION 60%SPACE SAVED WITH PARQUET AVERAGE CPU: 76.4% MAX I/O THROUGHPUT (per node): READ 4.4 GB/sec WRITE 2.8 GB/sec
  • 31. © 2017 IBM Corporation31 Lessons Learned: General!  Building a brand-new cluster from the ground up is tough!  Full stack tuning required in order to get the cluster to a state capable of handling 100TB  Pay close attention to how the data is loaded, and think carefully about the partitioning scheme  Be cognizant of data skew - especially on your partitioning scheme Concurrency is much more difficult than single stream Complex queries pose a significant problem for most SQL over Hadoop solutions at scale  Near best performance often achieved in the first 5-8 runs, absolute best may take much longer – the 80:20 rule.
  • 32. © 2017 IBM Corporation32 Lessons Learned: Spark SQL!  Spark SQL has come a long way, very quickly. BUT…  Success at lower scale factors does not guarantee success at higher scale factors  Significant effort required to tune failing queries Spark SQL still relies heavily (almost entirely) on manual tuning…  To get the best out of Spark SQL, the level of parallelism (num-executors) and memory assigned to them (executor-memory) needs to be tuned for each query, and the optimal values vary depending on how many other Spark queries are running in the cluster at that particular time.  Very difficult, if not impossible, to manage this in a production environment
  • 33. © 2017 IBM Corporation33 Lessons Learned: Big SQL!  4 query failures in early runs using product defaults:  Quickly eliminated via product tuning  Big SQL defaults changed as a result  Focused on hardening “Elastic Boost” capability to gain maximum throughput  Extensive development work in the Scheduler to evenly distribute work amongst the logical workers  Spare capacity (memory, cpu) could be better utilized  Could have done better!  Big SQL has unique tuning features to help with stubborn queries  Only a limited set of these are allowed by the Hadoop-DS rules, but could be deployed in production clusters
  • 34. © 2017 IBM Corporation34 ORC performance evaluation V5.0.1 V2.1 LLAP on TEZ Hadoop-DS @ 10TB Load Single Stream HARDWARE 17 LENOVO x3650 M4 NODES 640 LOGICAL CORES 2,048 GB RAM 288 TB DISK SPACE 10 Gb ETHERNET6- Streams
  • 35. © 2017 IBM Corporation35 PERFORMANCE: 6-streams BIG SQL 2.3X FASTER HADOOP-DS @ 10TB BIG SQL V5.0.1 AND HIVE 2.1 (LLAP WITH TEZ) AT A GLANCE: 85 COMMON QUERIES WORKING COMPLIANT QUERIES: 6-streams WORKLOAD SCALE FACTOR: 10 TB FILE FORMAT: ORC (ZLIB) CONCURRENCY: 6 STREAMS QUERY SUBSET: 85 QUERIES RESOURCE UTILIZATION: 6-STREAMS Big SQL: 1.5x FEWER CPU CYCLES USED STACK HDP 2.6.1 BIG SQL 5.0.1 HIVE 2.1 LLAP ON TEZ INTERESTING FACTS FASTEST QUERY 5.4X FASTER (BIG SQL: 1.5 SEC, HIVE: 8.1 SEC) SLOWEST QUERY (QUERY 67) 1.7X FASTER (BIG SQL: 6827 SEC, HIVE: 11830 SEC) BIG SQL FASTER FOR 80% OF QUERIES RUN PERFORMANCE: 1-stream BIG SQL 1.8X FASTER hrs hrs
  • 36. © 2017 IBM Corporation36 WHY ??? Advanced Autonomics Self Tuning Memory Manager Integrated Work Load Manager World Class Cost Based Optimizer Query rewrite Advanced Statistics Advanced Partitioning Native Row & Columnar stores Hardened runtime Elastic Boost SQL Compatibility
  • 37. © 2017 IBM Corporation37 So, what does all this boil down to ?  Data Scientists/Business Analysts can be 3-4 times more productive using Big SQL compared to Spark SQL.  With Big SQL, users can focus on what they want to do, and not worry about how it is executed. Proof points:  Able to successfully run all 99 TPC-DS queries @ 100TB in 4-concurrent streams  Performance leadership  Uses fewer cluster resources  Simpler configuration with mature self-tuning and workload management features  Big SQL is the best SQL over Hadoop engine for complex analytical workloads  No one else has published @100TB (or anywhere close)
  • 38. © 2017 IBM Corporation38 Questions? https://developer.ibm.com/hadoop/category/bigsql/
  • 39. © 2017 IBM Corporation39 Thank you!
  • 40. © 2017 IBM Corporation Backup slides
  • 41. © 2017 IBM Corporation41 SQL over Hadoop use cases SQL Adhoc data preparation for analytics. Federation Transactional with fast lookups. Fewer users. Adhoc queries and discovery. ELT and simple, large scale queries. Complex SQL. Many users. Deep analytics Operational Data Store Need to balance “best tool for the job” paradigm with maintainability and support costs. Big SQL Hive Spark SQL Big SQL Hbase Big SQL Phoenix Spark SQL Hive Big SQL Hbase Big SQL Phoenix But SQL can do so much more…
  • 42. © 2017 IBM Corporation42 Big SQL v5 + YARN Integration Dynamic Allocation / Release of Resources Big SQL Head NM NM NM NM NM NM Slider Client YARN Resource Manager & Scheduler Big SQL AM Big SQL Worker Big SQL Worker Big SQL Worker Container YARN components Slider Components Big SQL Components Big SQL Worker Big SQL Worker Big SQL Worker Stopped workers release memory to YARN for other jobs Stopped workers release memory to YARN for other jobs Stopped workers release memory to YARN for other jobs Big SQL Slider package implements Slider Client APIs HDFS
  • 43. © 2017 IBM Corporation43 Big SQL v5 Elastic Boost – Multiple Workers per Host More Granular Elasticity NM NM NM NM NM NM YARN Resource Manager & Scheduler Big SQL AM Container YARN components Slider Components Big SQL Components Big SQL Head Slider Client Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker HDFS
  • 44. © 2017 IBM Corporation44 Cluster Details (F1 Spark cluster) Designed for Spark  Totals across all Cluster Data nodes  1,080 cores, 2,160 threads  45TB memory  100TB database storage replicated 3X plus temp, 480TB raw, 240 MVMe  Hardware Details  100TB SQL Database requires a 28-node cluster • 2 management nodes (Lenovo x3650 M5), co-located with data nodes • 28 data nodes (Lenovo x3650 M5) • 2 racks, 20x 2U servers per rack (42U racks) • 1 switch, 100GbE, 32 ports, 1U, (Mellanox SN2700)  Each data node  CPU: 2x E5-2697 v4 @ 2.3GHz (Broardwell) (18c) Passmark: 23,054 / 46,108  Memory: 1.5TB per server 24x 64GB DDR4 2400MHz  Flash Storage: 8x 2TB SSD PCIe MVMe (Intel DC P3700), 16TB per server  Network: 100GbE adapter (Mellanox ConnectX-5 EN)  IO Bandwidth per server: 16GB/s, Network bandwidth 12.5GB/s  IO Bandwidth per cluster: 480GB/s, Network bandwidth 375GB/s
  • 45. © 2017 IBM Corporation45 Query compliance through the scale factors (cont) Almost half (7) of the Spark SQL 2.1 queries which fail at 100TB can be classified as complex in nature  No surprise since Spark is a relatively immature technology  In-line with findings from original Hadoop-DS work in 2014 Big SQL RDBMs heritage is the key to providing Enterprise grade SQL for complex analytical workloads Any query which does not complete, requires modification or tuning impacts business productivity and wastes valuable human & machine resources
  • 46. © 2017 IBM Corporation46 Hadoop Tuning HDFS Setting Default 100TB NameNode Java heap 4G 20G NameNode new generation size 512 2.5G NameNode maximum new generation size 512 2.5G Hadoop maximum Java heap size 4G 10G DataNode max data transfer threads (helps HDFS data rebalance) 4096 16384 MR2 Settings, applicable to load operations Default 100TB MapReduce Framework map memory 2G 35G MapReduce Framework reduce memory 4G 69G MapReduce Sort Allocation Memory (helps with hdfs rebalancing) 1G 2G MR Map Java Heap Size (MB) 1638 28262 MR Reduce Java Heap Size (MB) 7840 56524 mapreduce.jobhistory.intermediate-done-dir /var /data15/var mapreduce.task.io.sort.factor 100 1000 YARN setting Default 100TB Percentage of physical CPU allocated for all containers 80% 90% Number of virtual cores 57 (80%) 72 Container - Minimum container size 44G 20G ResourceManager Java heap size 1G 8G NodeManager Java heap size 1G 8G AppTimelineServer Java heap size 1G 8G YARN Java heap size 1G 2G Advanced: Fault Tolerance yarn.resourcemanager.connect.retry- interval.ms 30000 250 Advanced: Advanced yarn-site, hadoop.registry.rm.enabled False True Advanced: Advanced yarn-site, yarn.client.nodemanager-connect.retry- interval-ms 10000 250
  • 47. © 2017 IBM Corporation47 SPARK Tuning Spark Setting Default 10TB 100TB spark.rpc.askTimeout (s) 120 1200 36000 spark.kryoserializer.buffer.max (mb) 64 768 768 spark.yarn.executor.memoryOverhead (mb) 384 1384 8192 spark.driver.maxResultSize 1G 8G 40G spark.local.dir /tmp /data[1-10]/tmp /data[1-10]/tmp spark.network.timeout 120 1200 36000 spark.sql.broadcastTimeout 120 1600 36000 spark.buffer.pageSize computed computed 64m spark.shuffle.file.buffer computed computed 512k spark.memory.fraction 0.6 0.8 0.8 spark.scheduler.listenerbus.eventqueue.size 10K 120K 600K
  • 48. © 2017 IBM Corporation48 Big SQL Tuning Big SQL Setting Default 100TB Big SQL Workers per Node 1 12 INSTANCE_MEMORY 25% 97% DB2_CPU_BINDING 25% MACHINE_SHARE=94 DB2_EXT_TABLE_SETTINGS DFSIO_MEM_RESERVE:20 DFSIO_MEM_RESERVE:0 DFT_DEGREE 8 4 SORTHEAP SHEAPTHRES_SHR Computed Computed 4.4 G 70 G BufferPool Size Computed 15 G scheduler.cache.splits false true scheduler.assignment.algorithm GREEDY MLN_RANDOMIZED scheduler.dataLocationCount Computed max:28 scheduler.maxWorkerThreads Computed 8192 Green highlights the defaults will be changed in v5.0
  • 49. © 2017 IBM Corporation49 Publisher Date Product TPC-DS Queries Data Vol Cloudera Sept 2016 Impala 2.6 on AWS Claims 42% more performant than AWS Redshift 70 query subset 3TB Cloudera August 2016 Impala 2.6 Claims 22% faster for TPC-DS than previous version 17 queries referenced Not specified Cloudera April 2016 Impala 2.5 Claims 4.3x faster for TPC-DS than previous version 24 query subset 15TB *2 Hortonworks July 2016 Hive 2.1 with LLAP Claims 25x faster for TPC-DS than Hive 1.2 15 query subset 1TB Radiant Advisors *1 June 2016 Impala 2.5 on CDH 62 successful 37 fail 100GB / 1TB Radiant Advisors *1 June 2016 Presto .141t on Teradata Hadoop Appliance (HDP, CDH) 78 successful 21 fail 100GB / 1TB Radiant Advisors *1 June 2016 Hive 1.2.1, Tez 0.7.0 on HDP 63 successful 35 fail 100GB / 1TB What About Other SQL Hadoop TPC-DS Benchmarks? No other vendor has demonstrated ability to execute all 99 TPC-DS queries - even at lower scale factors.

Editor's Notes

  1. *TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC)
  2. Some of these challenges might sound familiar Moderning your existing EDW without long and costly migration efforts Querying across multiple types of data management Lack ability to query across traditional data warehouse and cloud based data systems Lack of skill set to migrate data from RDBMS to Hadoop/Hive Optimzing and integrating external data sources with existing data sources Offloading and porting workloads from Oracle, Db2 & Netezza Difficult to offload workloads from EDW platforms Performance Slows down once too many users access the system Interactive query performance is unacceptable Operational efficiency around older data warehouse environments Requires tools to help with ease of use and automation to manage workloads and schedule jobs
  3. 4 things to remember Compatible with Oracle, Db2, Netezza Modernizing EDW Workloads on Hadoop App portability Federates data behind a single SQL engine Uses connectors to Teradata, Oracle, Db2, Netezza Addresses a skill gap needed to migrated technologies People hate migrations on rewrite code – interrupts business process – takes other resources to do this work Skills gap to do this. Want an engine that is 100% compatible With Big SQL – you can go into your accounts with Netezza, Oracle, DB2 – they can move part of their RBDMS  Dataware house to hadoop without having to change any code Ask your customer – do you want to optimize your oracle, netezza, db2 workloads by moving them to Hive Sick of Oracle, Netezza?  We got a solution to help you get them out. Finally – it delivers that high performance for complex SQL workloads
  4. Big SQL is just an alternate execution engine that uses the same Hive storage model and integrates with the Hive metastore. In fact, Big SQL won’t work without the Hive metastore. Instead of MapReduce, Big SQL uses a native C/C++ MPP engine. Applications can choose to connect to Hive, or Big SQL – they both co-exist on the Hadoop platform. In the following slides, we’ll cover the benefits of using Big SQL’s execution engine over MapReduce.
  5. Federation consists of the following: a federated server, federated database, data sources, clients which can be users and applications to access both local database and database from data sources. Federation is known for its strengths: - Transparent: Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database - Extensible: Update data in relational data sources, as if the data is stored in the federated database - Autonomous: Move data to and from relational data sources without interruptions - High Function: Take advantage of the data source processing strengths, by sending requests to the data sources for processing - High Performance: Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server Below is a list of data sources supported for Big SQL. For the latest information, visit https://www-304.ibm.com/support/entdocview.wss?uid=swg27044495 Data Source Supported Versions Notes DB2 for z/OS® 8.x, 9.x, 10.x DB2®DB2 for LUW. 9.7, 9.8, 10.1, 10.5 Oracle 11g, 11gR1, 11g R2, 12c Teradata 12, 13, 14, 15 Not supported on POWER systems. Netezza 4.6, 5.0, 6.0, 7.2 Not supported on POWER systems. Informix 11.5 Microsoft SQL Server 2012, 2014 ODBC 3.0 or later Big SQL now comes pre-bundled with DataDirect drivers from Progress that eliminates the need for downloading drivers and enables easy setup Spark connectors enhances Big SQL’s connectivity to other data sources and also operationalize machine learning models
  6. Federation consists of the following: a federated server, federated database, data sources, clients which can be users and applications to access both local database and database from data sources. Federation is known for its strengths: - Transparent: Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database - Extensible: Update data in relational data sources, as if the data is stored in the federated database - Autonomous: Move data to and from relational data sources without interruptions - High Function: Take advantage of the data source processing strengths, by sending requests to the data sources for processing - High Performance: Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server Below is a list of data sources supported for Big SQL. For the latest information, visit https://www-304.ibm.com/support/entdocview.wss?uid=swg27044495 Data Source Supported Versions Notes DB2 for z/OS® 8.x, 9.x, 10.x DB2®DB2 for LUW. 9.7, 9.8, 10.1, 10.5 Oracle 11g, 11gR1, 11g R2, 12c Teradata 12, 13, 14, 15 Not supported on POWER systems. Netezza 4.6, 5.0, 6.0, 7.2 Not supported on POWER systems. Informix 11.5 Microsoft SQL Server 2012, 2014 ODBC 3.0 or later Big SQL is ANSI compliant, therefore it can run Oracle SQL, DB2 SQL and Netezza SQL. Big SQL tables can be created for data residing in HDFS, Hive, Hbase, Object Store and WebHDFS. Big SQL is aware of remote indexes and table statistics to optimize federated queries. Big SQL provides: a unified view of all your tables, with federated query support to external databases stored optimally as Hive, HBase, or read with Spark - optimized for the expected workload secured under a single security model (including row/column security across all) with the ability to join across all datasets using standard ANSI SQL across all types of tables with Oracle, NZ, DB2 extensions if you prefer using a single database connection and driver No other SQL engine for Hadoop even comes close.
  7. For Albert, from Branch A, only rows from his branch are listed because he has access to records with same branch name as his. Bonnie, from Branch B, can only view records from her branch and does have access to view salary. But Cindy, a security administrator, can view records from all branches and also view all columns. When customers take the time to review the security capabilities of other SQL Engines on Hadoop, they realize a lot of things they are used to are missing. Big SQL supports all of the important features. Separation of Duties – an important feature if you really want to operationalize the environment. Most enterprise customers don’t like having a database “super user” who can manage the environment and also see all data (like root access for the SQL engine). Big SQL can define Database Administrator roles that allow administration of the environment without permitting access to the data, for example. Big SQL also provide row level and column level security using row permissions (CREATE PERMISSION) or column masking (CREATE MASK) on the tables. Note that both Big SQL ROLES (same concept as DB2 or Oracle roles) and OS Groups (Local and LDAP) are supported.
  8. *TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC)
  9. Four broad types of queries in the set of 99. Reporting queries, Ad-hoc queries, Iterative OLAP queries, Data mining queries. Minimum of 4 concurrent users running all 99 queries to publish a valid result for an official TPC-DS benchmark. Results presented here are not an official TPC-DS benchmark result. But running TPC-DS queries has become the de-facto benchmark for SQL over Hadoop engines.
  10. Data Prep: Generate 100TB raw text (CSV) by the tool provided in the TPCDS toolkit Load text into Parquet tables (compression is on with snappy)
  11. Summary of cluster tuning…
  12. Tables partitioned on: _SALES tables partitioned on *_SOLD_DATE_SK _RETURNS tables partitioned on *_RETURN_DATE_SK
  13. Spark SQL made impressive strides in version 2.0 to be able to run all 99 TPC-DS queries out of the box (with allowed MQM rewrites ) at 1GB and 1TB. However, these tests show that as the volume of data grows beyond 1TB, Spark 2.1 struggles complete all 99 queries. Spark SQL was able to complete the following 83 queries at 100TB: 1,2,3,4,5,7,8,9,10,11,12,13,17,18,19,21,22,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,76,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,96,97,99
  14. For an apples to apples comparison, the 83 queries which Spark could successfully complete were executed in a single stream run on both Big SQL and Spark SQL: 1,2,3,4,5,7,8,9,10,11,12,13,17,18,19,21,22,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,76,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,96,97,99
  15. For an apples to apples comparison, the 83 queries which Spark could successfully complete were executed in 4 concurrent streams on both Big SQL and Spark SQL: 1,2,3,4,5,7,8,9,10,11,12,13,17,18,19,21,22,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,76,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,96,97,99
  16. First – let’s look at efficiency. Big SQL actually used less average CPU compared to Spark SQL. User CPU time is time spent on the processor running your program's code (or code in libraries); System CPU time is the time spent running code in the operating system kernel on behalf of your program. Another way to look at system CPU time is it’s the time the processor worked on operating system's functions connected to that specific program (e.g. forking a process / process management) In general, System CPU is like wasted CPU cycles – so you can see that Big SQL is roughly 3X more efficient in this regard.
  17. Charts shows average memory consumption across each node in the cluster whilst the 4-stream workload was running.
  18. Charts show average and max i/o throughput per node during the 4-stream test.
  19. Results of the comparison summarized in a single slide.
  20. There is still no substitute for product maturity, and the strength of Big SQL comes from its heritage. The core of Big SQL is built around IBM’s Relational Database technology which has been running in production systems for 25 plus years. As such, Big SQL has a hardened, mature, and very efficient SQL optimizer and runtime – this is the key to its world class scalability and performance. It’s this lineage which raises Big SQL above other vendors in the SQL over Hadoop space.
  21. Main points: This slide summarizes the results of the Big SQL and Hive evaluation. Big SQL 5.0.1 completed the Hadoop-DS 6-stream concurrency test 2.3 times faster than Hive 2.1, both engines using 85 queries. Big SQL can run all 99 queries in the Hadoop-DS/TPC-DS workload, but for this exercise we compared Big SQL and Hive using a subset with 85 queries. 14 queries were dropped from the query set because we were unable to come up with compliant and working versions of these queries for Hive. By “compliant”, we mean any changes required to get the query to run adhere to the “minor query modifications” allowed by the TPC-DS specification. By “working”, we mean the query completed successfully. Scale factor was 10 TB. The data was loaded into the ORC file format with Zlib compression.. We installed HDP 2.6.1 which included Hive 2.1. We installed Big SQL 5.0.1 on top of that. We ran Hive 2.1 with LLAP on TEZ. For both Big SQL and Hive, the slowest query was query 67. With 6 streams, the workload consisted of 85 x 6 = 510 total queries. For 80% of the queries (out of 510 total queries), Big SQL was faster. Big SQL used 1.5 times fewer CPU cycles to complete the workload. Big SQL consumed a larger percentage of available CPU during the runs, but since Big SQL finished sooner, it actually completed the workload using less total CPU cycles. This evaluation was done on a 17-Node cluster (Lenovo x3650 M4’s) with 1 management node and 16 compute nodes. The numbers for cores, memory, and disk space shown here are the totals across the 16 compute nodes. The network was 10 gigabit ethernet. Both the Hive and Big SQL tables were partitioned.
  22. There is still no substitute for product maturity, and the strength of Big SQL comes from its heritage. The core of Big SQL is built around IBM’s Relational Database technology which has been running in production systems for 25 plus years. As such, Big SQL has a hardened, mature, and very efficient SQL optimizer and runtime – this is the key to its world class scalability and performance. It’s this lineage which raises Big SQL above other vendors in the SQL over Hadoop space.
  23. The proof points will be proved during the course of this presentation
  24. The great thing about Hadoop is that there are so many great open source engines available. Rather than trying to re-invent storage options, Big SQL exploits what is there…. HBase is another engine that can provide rapid, scalable, lookups with update and delete support.
  25. NM = YARN Node Manager This is a high level architectural view of how Big SQL integrates with YARN through Slider. The Slider Client is a general purpose client for applications like Big SQL to interoperate with YARN. The client defines a set of APIs that are to be implemented by the application (Big SQL). So the red cube/package represents a slider package included with Big SQL that implements the slider APIs. When we start up Big SQL, the slider client negotiates resource allocation with YARN and the Big SQL workers are now started in containers. [click] If we want to free up some resources back to YARN (not much demand for Big SQL at the moment/time of day), we use slider to stop a subset (or all) Big SQL workers. The memory formerly allocated to Big SQL workers are now available to other workloads in Hadoop.
  26. Instead of running one (potentially very large memory) worker per host, elastic boost enables the ability to run multiple smaller Big SQL workers per host. We can start and stop workers fully independent of each other. Therefore, while the same amount of maximum memory/CPU is consumed when all workers are on…. we get much more granularity of resource usage in terms of scaling up and down capacity with more workers.
  27. Cluster was designed for Spark SQL; hence it has lots of memory, fast SSDs and a high bandwidth network.
  28. Almost half of the failed Spark SQL queries can be classified as complex – those which do deep analytics. Indicating Spark SQL struggles most with the complex queries at larger scale factors.
  29. 100TB some queries returned huge amounts of data (millions of rows, no fetch only caluse) so 40GB results Spread the local dirs onto data disks, comma-delimited Lots of tasks in lots of stages creating a huge eventqueue, must increase its capacity 0.8 for memory.fraction: Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. Higher value from 0.6 to 0.8 will help reduce spills.
  30. Chart shows the settings tuned for Big SQL. Those default values for the properties highlighted in green will be changed in v4.3 – meaning these 5 values will not require further tuning. INSTANCE_MEMORY and DB2_CPU_BINDING are configured to allow Big SQL to use 97% of memory on each node (the other 3% being used for OS and other Hadoop components, such as HDFS Namenode), and 94% of CPUs on each node to BigSQL. 94% was used as it allowed 2 cores to be reserved for OS and the HDFS Namenode. DFT_DEGREE was dialled down because parallelism is achieved by defining multiple Big SQL workers per node – so the need for SMP parallelism is reduced. Actual settings for sort space are in 4k pages: SORTHEAP 1146880 ; SHEAPTHRES_SHR 18350080 And for Bufferpools in 32k pages: BP 491520 MLN_RANDOMIZED scheduler allocation algorithm was specifically designed for environments with multiple Big SQL workers per node.
  31. Table shows TPC-DS based benchmarks published by SQL over Hadoop vendors during 2016. https://blog.cloudera.com/blog/2016/09/apache-impala-incubating-vs-amazon-redshift-s3-integration-elasticity-agility-and-cost-performance-benefits-on-aws/ http://blog.cloudera.com/blog/2016/08/bi-and-sql-analytics-with-apache-impala-incubating-in-cdh-5-8-3x-faster-on-secure-clusters/ https://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/ http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ http://radiantadvisors.com/whitepapers/sqlonhadoopbenchmark2016/ https://issuu.com/radiantadvisors/docs/radiantadvisors_sql-on-hadoop_bench *1 Radiant Advisors benchmark sponsored by Teradata *2 Only one of the 24 queries tested references a fact table other than store_sales. So can this really be claimed as a 15TB benchmark if the vast majority of queries reference only approx. ½ of the data set. HDB/HAWQ are not in the chart because they published no new benchmarks in 2016. Vertica claim to run 98 of the queries in a presentation from last August at HPE Big Data conference. Data volume, release levels, and configuration were not specified and so it is not included in the table.