Confidential © 2014 Actian Corporation1
Highest Performing SQL-in-Hadoop
Announcing “Project Vortex”
Peter Boncz Database ...
Confidential © 2014 Actian Corporation2
History of Vectorwise and Vector Processing
Survey of SQL-on-Hadoop Approaches
Vor...
Confidential © 2014 Actian Corporation3
MonetDB X100 engine
1994 2004
Start of new
wave of
analytic
DBMS
High
performance
...
Confidential © 2014 Actian Corporation4
Typical RDBMS: tuple-at-a-time iterator
Query
SELECT
name,
salary*.19 AS tax
FROM
...
Confidential © 2014 Actian Corporation5
“Vectorized Query Execution”
Vector contains
data of multiple
tuples (~100)
All pr...
Confidential © 2014 Actian Corporation6
Why Vectors Are Better
Column slices to
represent in-flow data
NOT:
Vertical is a ...
Confidential © 2014 Actian Corporation7
Vectorized “Primitives” (basic methods)
int
map_mul_flt_col_flt_val (
int *res,
in...
Confidential © 2014 Actian Corporation9
MonetDB X100 engine Vectorwise
1994 2004 2010
Start of new
wave of
analytic
DBMS
H...
Confidential © 2014 Actian Corporation10
PhD thesis of Spyros Blanas (2013)
Confidential © 2014 Actian Corporation11
the relational
industry is
trying to adopt
vector processing...
Confidential © 2014 Actian Corporation12
Actian Vector: Often copied, never surpassed..
Confidential © 2014 Actian Corporation13
MonetDB X100 engine Vectorwise
1994 2004 2010
Start of new
wave of
analytic
DBMS
...
Confidential © 2014 Actian Corporation14
Hive gets
it too!
Confidential © 2014 Actian Corporation15
MonetDB X100 engine Vectorwise
1994 2004 2010
Start of new
wave of
analytic
DBMS
...
Confidential © 2014 Actian Corporation16
Big Data processing pipelines on Hadoop
■ Unstructured  Structured
■ Unstructure...
Confidential © 2014 Actian Corporation17
SQL Outside Hadoop
■ MPP DB  need 2 clusters
■ Connector approach (data copy)
Ma...
Confidential © 2014 Actian Corporation18
“wrapped
legacy”
“from scratch”
SQL
Maturity
(performance+features)
Hadoop Integr...
Confidential © 2014 Actian Corporation20
“Project Vortex”: Actian Vector in Hadoop
First industry-strength analytical RDBM...
Confidential © 2014 Actian Corporation21
“Project Vortex”: Actian Vector in Hadoop
Hadoop Features in Development:
Automat...
Confidential © 2014 Actian Corporation22
Project Vortex: Architecture
Single SQL frontend connect point
■ Does not store a...
Confidential © 2014 Actian Corporation23
Vortex
“worker-set”
YARN
name
node
Vortex Architecture
session master
X100
backen...
Confidential © 2014 Actian Corporation24
Project Vortex: Storage
Data Format
■ Vector native compressed data formats with ...
Confidential © 2014 Actian Corporation25
p1
p2
p3 p2
p4
p5 p4
p6
p1
p6
p1
p3 p5
p3
p2p4
p5
Vortex
“worker-set”
p6
YARN
WAL...
Confidential © 2014 Actian Corporation26
Project Vortex: Minimizing Network Traffic
Storage
■ Co-located partitions (local...
Confidential © 2014 Actian Corporation27
Project Vortex: Resource Management
YARN integration
■ Ask YARN which nodes are l...
Confidential © 2014 Actian Corporation28
p1
p2
p3 p2
p4
p5 p4
p6
p1
p6
p1
p3 p5
p3
p2p4
p5
Vortex
“worker-set”
p6
minimal ...
Confidential © 2014 Actian Corporation29
Project Vortex: Data Ingestion
Bulk-load
■ Fast Parallel Loader, executes in para...
Confidential © 2014 Actian Corporation30
Positional Delta Trees (PDTs)
INSERT INTO inventory VALUES(‘Berlin’, ‘table’, Y, ...
Confidential © 2014 Actian Corporation31
Vortex vs Impala: how much faster?
Background to “Impala Subset “of TPC-DS benchm...
Confidential © 2014 Actian Corporation32
Vortex vs. other “native” Products
Young systems (Hive, Impala, Presto)
■ Signifi...
Confidential © 2014 Actian Corporation33
“Project Vortex” Timeline
Actian Vector in Hadoop - Preview Edition Available
■ S...
Confidential © 2014 Actian Corporation34
Visit the Actian booth #P6 in the expo area!
■ Get a copy of the Project Vortex T...
Confidential © 2014 Actian Corporation35
Acknowledgements
homepages.cwi.nl/~boncz/msc/2012-AndreiCosteaAdrianIonescu.pdf
A...
Confidential © 2014 Actian Corporation36
www.actian.com
facebook.com/actiancorp
@actiancorp
Thank You
Upcoming SlideShare
Loading in …5
×

Actian Vector on Hadoop: First Industrial-strength DBMS to Truly Leverage Hadoop

2,863 views

Published on

Published in: Technology, Business
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,863
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
  • internationalization
  • internationalization
  • Execution
    Subset of TPC-DS as chosen by Impala
    Data size is 3TB (SF3000)
    Executed on 5-node “rushcluster” in Austin
    Both Impala and Vector numbers are on the same hardware
    Comparison with Impala
    Verified that Impala plans are sensible
    Currently observed average speedup is 11x
    Optimal query plans (manually written) gives us 16x speedup
    These are real numbers! We executed manual plans directly
    Changes in the cost model would get us to this performance
    Performance improvements
    Cost model changes will get us to 16x speedup
    Pipeline of query execution changes
    Well into H2
    Estimated to get us 2x improvement
    So, estimated speedup vs Impala would be ~30x (no guarantees)

    Planning to run TPC-H SF1000 and SF3000
    With all planned improvements (end of the year) we should be able to beat the EXASOL cluster numbers.
  • Actian Vector on Hadoop: First Industrial-strength DBMS to Truly Leverage Hadoop

    1. 1. Confidential © 2014 Actian Corporation1 Highest Performing SQL-in-Hadoop Announcing “Project Vortex” Peter Boncz Database Systems Researcher & Actian Chief Technical Advisor MonetDB architect & Vectorwise founder Hadoop Summit - San Jose, June 3 2014
    2. 2. Confidential © 2014 Actian Corporation2 History of Vectorwise and Vector Processing Survey of SQL-on-Hadoop Approaches Vortex (Actian Vector) Architecture Benchmark Results Roadmap Agenda
    3. 3. Confidential © 2014 Actian Corporation3 MonetDB X100 engine 1994 2004 Start of new wave of analytic DBMS High performance DBMS using CPU cache optimizations Vector Database Processing Timeline Column-store pioneer Vector execution model
    4. 4. Confidential © 2014 Actian Corporation4 Typical RDBMS: tuple-at-a-time iterator Query SELECT name, salary*.19 AS tax FROM employee WHERE age < 25 SCAN SELECT PROJECT 30000 john40 next() next() next() 10000 carl20 10000 carl20 10000 carl20 1900 carl20 30000 john40
    5. 5. Confidential © 2014 Actian Corporation5 “Vectorized Query Execution” Vector contains data of multiple tuples (~100) All primitives are “vectorized” Effect: much less Iterator.next() and primitive calls.
    6. 6. Confidential © 2014 Actian Corporation6 Why Vectors Are Better Column slices to represent in-flow data NOT: Vertical is a better table storage layout than horizontal (though we still think it often is) RATIONALE: - Simple array operations are well-supported by compilers No record layout complexities - SIMD friendly layout - Assumed cache-resident
    7. 7. Confidential © 2014 Actian Corporation7 Vectorized “Primitives” (basic methods) int map_mul_flt_col_flt_val ( int *res, int *col, int val, int n) { for(int i=0; i<n; i++) res[i] = col[i]*val; } Many primitives take just 1-6 cycles per tuple High IPC. Get SIMD out of the box. No instruction or data cache misses. 10-100x faster than Tuple-at-a-time
    8. 8. Confidential © 2014 Actian Corporation9 MonetDB X100 engine Vectorwise 1994 2004 2010 Start of new wave of analytic DBMS High performance DBMS using CPU cache optimizations Vectorwise blows the top off TPC benchmarks Vector Database Processing Timeline Column-store pioneer Vector execution model Actian launches 1st commercial vector processing DBMS
    9. 9. Confidential © 2014 Actian Corporation10 PhD thesis of Spyros Blanas (2013)
    10. 10. Confidential © 2014 Actian Corporation11 the relational industry is trying to adopt vector processing...
    11. 11. Confidential © 2014 Actian Corporation12 Actian Vector: Often copied, never surpassed..
    12. 12. Confidential © 2014 Actian Corporation13 MonetDB X100 engine Vectorwise 1994 2004 2010 Start of new wave of analytic DBMS High performance DBMS using CPU cache optimizations Vectorwise blows the top off TPC benchmarks Vector Database Processing Timeline Column-store pioneer Vector execution model Actian launches 1st commercial vector processing DBMS 2012 SQL on Hadoop Introduction of SQL access to Hadoop data Immature, not optimized, not enterprise- ready
    13. 13. Confidential © 2014 Actian Corporation14 Hive gets it too!
    14. 14. Confidential © 2014 Actian Corporation15 MonetDB X100 engine Vectorwise 1994 2004 2010 Start of new wave of analytic DBMS High performance DBMS using CPU cache optimizations Vectorwise blows the top off TPC benchmarks Actian Introduces “Project Vortex” 2014 Vector Database Processing Timeline Column-store pioneer Vector execution model Actian launches 1st commercial vector processing DBMS 2012 SQL on Hadoop Introduction of SQL access to Hadoop data Immature, not optimized, not enterprise- ready Vectorwise built natively into Hadoop Highest performing, SQL compliant DBMS running inside Hadoop
    15. 15. Confidential © 2014 Actian Corporation16 Big Data processing pipelines on Hadoop ■ Unstructured  Structured ■ Unstructured: Data Mining, Pattern Matching (MapReduce) ■ Structured: Cleaner data, bulk loads into warehouse ■ Do we have to buy/manage two clusters?? 1. Hadoop/MapReduce 2. MPP SQL warehouse The case for SQL on Hadoop: ■ Reduced hardware cost (1 cluster) ■ Agile: no more data copying data between Hadoop and SQL ■ Broaden access to Hadoop data through a wealth of SQL apps ■ Standardize cluster admin skills on Hadoop (human resources) SQL on Hadoop
    16. 16. Confidential © 2014 Actian Corporation17 SQL Outside Hadoop ■ MPP DB  need 2 clusters ■ Connector approach (data copy) Mature but Limited/Slow ■ Slow legacy query engine (e.g. PostgreSQL) ■ Limited HDFS integration (no deletes,updates) Integrated but Immature ■ Immature/poor optimizers+engines ■ Incomplete SQL support, no delete/updates, I18N, security, workload mgmt, access control? Vendor Approaches to “SQL on Hadoop” “outside Hadoop” “wrapped legacy” “from scratch”
    17. 17. Confidential © 2014 Actian Corporation18 “wrapped legacy” “from scratch” SQL Maturity (performance+features) Hadoop Integration “SQL on Hadoop” Vendor Landscape Low Native High “outside Hadoop” Most Mature & Integrated SQL
    18. 18. Confidential © 2014 Actian Corporation20 “Project Vortex”: Actian Vector in Hadoop First industry-strength analytical RDBMS “made for Hadoop” Key Features compressed vector data formats work natively on HDFS the most efficient query engine on the market easily configurable and maintainable MPP system very high bulk-load performance full SQL functionality mature query optimizer HDFS (append-only) and compressed columnar storage are friends Vectorized, leading single-server TPC-H for years Relies solely on Hadoop for system administration. Partitioned table support and fully parallel loading Incl. access control, analytic/window functions, complete SQL APIs Enhanced with advanced distributed parallel execution for scale-up/out
    19. 19. Confidential © 2014 Actian Corporation21 “Project Vortex”: Actian Vector in Hadoop Hadoop Features in Development: Automatic HDFS block placement Direct Querying on Hadoop data formats Support for full fine-grained trickle updates (insert/delete/modify) YARN integration Elastic resource management Leveraging replication, always HDFS shortcut reads also after nodes fail. Co-existence of MapReduce and DBMS, avoiding stragglers Thanks to patented delta update structure (Positional Delta Trees) Text, Parquet, ORCfile Workload-driven scaling up&down in 40 steps from 2.5% to 100%
    20. 20. Confidential © 2014 Actian Corporation22 Project Vortex: Architecture Single SQL frontend connect point ■ Does not store any data ■ Can be outside Hadoop cluster ■ Can be an existing Vector installation ■ Many “worker” data nodes (X100 backend) on Hadoop cluster ■ This collection of compute nodes is called the “worker set” ■ MPI communications, all-to-all Worker Set ■ Subset of Hadoop cluster, can be shrunk/enlarged without data copy ■ Computer Nodes in worker set should have roughly equal resources ■ Any can coordinate query execution (session master)
    21. 21. Confidential © 2014 Actian Corporation23 Vortex “worker-set” YARN name node Vortex Architecture session master X100 backend X100 backend X100 backend X100 backend X100 backend SQL frontend query plan X100 backend data nodes processes running on the worker set all-to-all MPI data communications Actian Director for ManagementSQL
    22. 22. Confidential © 2014 Actian Corporation24 Project Vortex: Storage Data Format ■ Vector native compressed data formats with fast decompression ■ MinMax indexes stored separately (allow to avoid reading data blocks) ■ HDFS block placement: we decide were the replicas are ■ Tables are either hash-partitioned or global (i.e. non-partitioned) Global File System ■ All I/O is through HDFS ■ Achieved in an append-only file system! ■ Any worker can read any table partition ■ Responsibilities for handling partitions is decided at session start ■ Optimization algorithm assigns partitions to nodes that have the file local ■ 100% HDFS “shortcut reads”, also when the node that wrote the partition is down
    23. 23. Confidential © 2014 Actian Corporation25 p1 p2 p3 p2 p4 p5 p4 p6 p1 p6 p1 p3 p5 p3 p2p4 p5 Vortex “worker-set” p6 YARN WAL WAL WAL name node g g g Vortex Architecture p6 p1 p2 p3 p4 p5 session master partitioned table X100 backend X100 backend X100 backend X100 backend X100 backend SQL frontend HDFS“shortcutreads” query plan X100 backend HDFSblockplacementhints g global table write ahead log WAL data nodes processes running on the worker set all-to-all MPI data communications Actian Director for ManagementSQL
    24. 24. Confidential © 2014 Actian Corporation26 Project Vortex: Minimizing Network Traffic Storage ■ Co-located partitions (local partitioned hash-joins) ■ Replicated tables (local shared-HashTable hash-joins) ■ Co-partitioned clustered indexes (local merge-joins) ■ MinMax indexes for predicate pushdown (correlates over merge-joins) Parallel Cost Model ■ Distributed joins, distributed query optimizer considers: ■ Both key-partitioned and shared (broadcast) HashJoin ■ Local broadcast HashJoin for replicated tables ■ Distributed GroupBy, distributed query optimizer considers: ■ Both key-partitioned and global re-aggregated GroupBy ■ Local early aggregation followed by partitioned aggregation
    25. 25. Confidential © 2014 Actian Corporation27 Project Vortex: Resource Management YARN integration ■ Ask YARN which nodes are less busy, when enlarging the worker set ■ Inform YARN of our usage (CPU, memory) to prevent overload ■ Placeholder processes to decrease and increase YARN resources Workload management ■ Workload monitoring to gradually determine Hadoop footprint ■ Choose (# cores, RAM) for each query, given the current footprint ■ Choose to involve all or just the minimal subset of workers Elasticity ■ Scale down to minimal subset of nodes, one core each ■ Scale up to all nodes, all cores
    26. 26. Confidential © 2014 Actian Corporation28 p1 p2 p3 p2 p4 p5 p4 p6 p1 p6 p1 p3 p5 p3 p2p4 p5 Vortex “worker-set” p6 minimal YARN footprint maximal YARN footprint YARN WAL WAL WAL name node g g g Vortex Architecture p6 p1 p2 p3 p4 p5 session master partitioned table X100 backend X100 backend X100 backend X100 backend X100 backend SQL frontend HDFS“shortcutreads” Hadoop & Vortex resource info query plan X100 backend HDFSblockplacementhints g global table write ahead log WAL data nodes processes running on the worker set all-to-all MPI data communications Actian Director for ManagementSQL
    27. 27. Confidential © 2014 Actian Corporation29 Project Vortex: Data Ingestion Bulk-load ■ Fast Parallel Loader, executes in parallel on all worker nodes ■ SQL COMBINE statement to add and remove data in bulk ■ Text and Parquet readers including nested records (under development) Updates (DML) ■ Support for Insert, Modify, Delete, Upsert ■ Modify, Deleted, Upsert use Positional Delta Trees (PDTs) ■ Changes get sent to master who emits Write Ahead Log (WAL) ■ At startup, workers only load PDTs for their partitions from WAL ■ Partitioned Tables partition DML to all nodes in worker set ■ Replicated Tables execute DML on the session master ■ Session master broadcasts all PDT changes to all worker nodes
    28. 28. Confidential © 2014 Actian Corporation30 Positional Delta Trees (PDTs) INSERT INTO inventory VALUES(‘Berlin’, ‘table’, Y, 10) INSERT INTO inventory VALUES(‘Berlin’, ‘cloth’, Y, 20) INSERT INTO inventory VALUES(‘Berlin’, ‘chair’, Y, 5) 0 2 1 SID ∆ 0 0 ins ins (Berlin, chair, Y,5) (Berlin, cloth, Y, 20) SID type value 0 ins (Berlin, table, Y,10) SID type value SID STORE PROD NEW QTY RID 0 London chair N 30 0 1 London stool N 10 1 2 London table N 20 2 3 Paris rug N 1 3 4 Paris stool N 5 4 TABLE0 “Positional Update Handling in Column Stores” – SIGMOD 2010 PDTs enable fine-grained updates on append-only data (HDFS)
    29. 29. Confidential © 2014 Actian Corporation31 Vortex vs Impala: how much faster? Background to “Impala Subset “of TPC-DS benchmark can be found here: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ Both Executed on the Same Hardware and Software Environment: 5 nodes:16core, 32thread, 2.4GHz, 64GB RAM, 2x1TB drives, 2x10Gb Ethernet. Non-Disclosure – Under Embargo Until Public Launch Date: June 3, 2014 q3 q7 q19 q27 q34 q42 q43 q46 q52 q53 q55 q59 q63 q65 q68 q73 q79 q89 q98 Avg: 14x faster 5x 10x 15x 20x 25x
    30. 30. Confidential © 2014 Actian Corporation32 Vortex vs. other “native” Products Young systems (Hive, Impala, Presto) ■ Significantly lower performance ■ Incomplete SQL (window functions, correlated subqueries, views) ■ No trickle updates (or just bulk load), not always ACID ■ Immature Query Optimizer, authentication access control, I18N, workload management, APIs, validated SQL apps  Vortex  Ultimate SQL on Hadoop Performance  The fastest analytical query engine in town comes to Hadoop  Lots of Parallel Query optimization (min. network bandwidth usage)  Superior Hadoop Integration  Optimized HDFS block placement  YARN integration, Elasticity
    31. 31. Confidential © 2014 Actian Corporation33 “Project Vortex” Timeline Actian Vector in Hadoop - Preview Edition Available ■ Send request to info@actian.com End of June: initial release ■ Good performance on medium-sized clusters ■ Core Actian DataFlow integration Fall 2014: second release ■ Trickle update functionality ■ Performance and scalability optimizations ■ HDFS block placement ■ YARN dynamic resource management
    32. 32. Confidential © 2014 Actian Corporation34 Visit the Actian booth #P6 in the expo area! ■ Get a copy of the Project Vortex Technical White Paper ■ See a live product demo of Vortex vs Impala ■ Meet the Actian “Vortex” developers Learn More… Win a signed technical book! ■ signing @16:00 Get a Big Data T-shirt!
    33. 33. Confidential © 2014 Actian Corporation35 Acknowledgements homepages.cwi.nl/~boncz/msc/2012-AndreiCosteaAdrianIonescu.pdf Adrian Ionescu Andrei Costea (plus the extended Actian Vector team)
    34. 34. Confidential © 2014 Actian Corporation36 www.actian.com facebook.com/actiancorp @actiancorp Thank You

    ×