Confidential © 2014 Actian Corporation1
SQL + Hadoop: The High Performance
Advantage
Turn Hadoop into a High Performance A...
Confidential © 2014 Actian Corporation2
1. Introduction
2. Hadoop Challenges
3. Actian Analytics Platform –
Hadoop SQL Edi...
Confidential © 2014 Actian Corporation3
$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide office...
Confidential © 2014 Actian Corporation4
Big Data Offers Significant Opportunities
Personalized Experience
New Products/Ser...
Confidential © 2014 Actian Corporation5
Enter Hadoop as the Big Data Enabler
for Low Cost Storage
DW
Offload
Landing
Zone
...
Confidential © 2014 Actian Corporation6
But It isn’t Easy with Hadoop
Batch performance
Time to Value
Expensive Skills
Sil...
Confidential © 2014 Actian Corporation7
Hadoop Complexity Forcing Organizations
to Move Data in order to Analyze it
DW
Off...
Confidential © 2014 Actian Corporation8
CIOs Challenged by Big Data Costs
One in three CIOs pay
between 21 cents to 30 cen...
Confidential © 2014 Actian Corporation9
CIOs Challenged by Types of Big Data
73% of CIOs day up
to 50% of their data
will ...
Confidential © 2014 Actian Corporation10
Instead, what if you could move the
analytic processing to the Hadoop data?
Data ...
Confidential © 2014 Actian Corporation11
What is it?
Introducing the Actian Analytics Platform –
Hadoop SQL Edition
Patent...
Confidential © 2014 Actian Corporation12
The Industry’s Abuzz – about Actian!
“Deploying on Hadoop enables the Actian Anal...
Confidential © 2014 Actian Corporation13
Libraries of Analytics
Hadoop
Connections to Access Any Data
Actian Analytics Pla...
Confidential © 2014 Actian Corporation14
Actian Analytics Platform – Hadoop SQL Edition
Lightning fast and industrial stre...
Confidential © 2014 Actian Corporation15
Visual Data Science & Analytics Workbench
• Drag/drop interface with 100’s of dat...
Confidential © 2014 Actian Corporation16
Ubiquitous Skills
■ 1 Million+ SQL Users
■ $ Lower cost
■ Easy to find, in most
c...
Confidential © 2014 Actian Corporation17
Actian Analytics Platform = 25 Minutes
Log Reader Filter Rows Group Load Vectork-...
Confidential © 2014 Actian Corporation18
Vendor Approaches to “SQL on Hadoop”
“marketing jobs”
“wrapped legacy”
“from scra...
Confidential © 2014 Actian Corporation19
“wrapped
legacy”
“from
scratch”
Maturity
(SQL support,
ACID, reliability,
securit...
Confidential © 2014 Actian Corporation20 Confidential © 2014 Actian Corporation 20
Actian Vector Hadoop Edition
Actian Ana...
Confidential © 2014 Actian Corporation21
Actian Vector – Unmatched InnovationTime/CyclestoProcess
Data Processed
DISK
RAM
...
Confidential © 2014 Actian Corporation22
TPC-H 1TB – Faster, Less Hardware
0 100,000 200,000 300,000 400,000
Actian Vector...
Confidential © 2014 Actian Corporation23
HADOOP
YARN
HDFS
Standard
SQL
Interfaces
DataNode
HDFS
Visual Data
& Analytics
Wo...
Confidential © 2014 Actian Corporation24
History of the TPC-DS Comparison
Confidential © 2014 Actian Corporation25 Confidential © 2014 Actian Corporation 25
TPC-DS Benchmark Components
Operational...
Confidential © 2014 Actian Corporation26
Actian Hadoop SQL Performance
0
5
10
15
20
25
30
35
Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46...
Confidential © 2014 Actian Corporation27
Comprehensive – covers full analytic process: data blending & enrichment, discove...
Confidential © 2014 Actian Corporation28
Actian Director for Management
Confidential © 2014 Actian Corporation29
Actian Analytics Platform – Hadoop SQL Edition
Industrialized, High-Performance S...
Confidential © 2014 Actian Corporation30
Transform Hadoop – Transform your Business
Confidential © 2014 Actian Corporation31
3
Get started today! www.actian.com/hadoop
Pre-register for an
evaluation copy of...
Confidential © 2014 Actian Corporation32
3
Get started today! www.actian.com/hadoop
Pre-register for an
evaluation copy of...
Upcoming SlideShare
Loading in...5
×

SQL + Hadoop: The High Performance Advantage�

283

Published on

Turn Hadoop into a High Performance Analytics Platform

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
283
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • But it isn’t easy

    Changing your company is not easy.

    Give examples: you’ve just invested $1m in a data warehouse, but business now wants to … It now will cost you 10 fold.
  • We are announcing Vector on Hadoop - industrial strength sql on hadoop with atom smashing speed never before seen in the industry. This is a core part of our Actian Analytics Platform – Hadoop SQL Edition. Let me tell you about it (details below) and show you a few things.

    What are we announcing?
    Highest performing, most industrialized SQL in Hadoop Turns Hadoop into a High-Performance, Fully-Functional Analytics Database
    Actian Analytics Platform – Hadoop SQL Edition includes our hardened (patented) X100 vector processing engine, combined with Actian’s visual data and analytics work flow, all running natively in Hadoop via YARN
    How is this unique?
    Highest performing, most industrialized SQL access to Hadoop data
    Only end-to-end analytic processing natively in Hadoop (covers the full analytics processes: data blending & enrichment, discovery & data science, analytics & operational BI)
    Most consumable, accessible, manageable Hadoop analytics

    What does this mean to our customers?
    Removes all barriers for business access to big data analytics
    Unleashes millions of business-savvy, SQL users with no constraints on Hadoop data to improve the accuracy of their analytical predictions and decision-making
    Accelerates time to value and turns Hadoop data into transformational value: customer delight, competitive advantage, world-class risk management, disruptive business models
  • I’m going to show you three things: How fast it is, how easy it is to get started and how it can be used in real-world scenarios.
  • internationalization
  • 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.

    A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.

    2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.

    3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512mb and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.

    To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.

    4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
    We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.

    5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.

    All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  • Execution
    Subset of TPC-DS as chosen by Impala
    Data size is 3TB (SF3000)
    Executed on 5-node “rushcluster” in Austin
    Both Impala and Vector numbers are on the same hardware
    Comparison with Impala
    Verified that Impala plans are sensible
    Currently observed average speedup is 11x
    Optimal query plans (manually written) gives us 16x speedup
    These are real numbers! We executed manual plans directly
    Changes in the cost model would get us to this performance
    Performance improvements
    Cost model changes will get us to 16x speedup
    Pipeline of query execution changes
    Well into H2
    Estimated to get us 2x improvement
    So, estimated speedup vs Impala would be ~30x (no guarantees)

    Planning to run TPC-H SF1000 and SF3000
    With all planned improvements (end of the year) we should be able to beat the EXASOL cluster numbers.
  • What are we announcing?
    Actian Analytics Platform – Hadoop SQL Edition, the first offering that turns Hadoop into a fully-functioning analytics platform.
    This new edition introduces the highest performing, most industrialized SQL in Hadoop, powered by our hardened (patented) X100 vector processing engine, combined with Actian’s visual data and analytics work flow, all running natively in Hadoop via YARN.

    How is this unique?
    Provides the only end-to-end analytic processing natively in Hadoop (covers the full analytics processes: data blending & enrichment, discovery & data science, analytics & operational BI)
    Delivers the highest performing, most industrialized SQL access to Hadoop data
    Makes the entire analytic process more consumable, easier to access, and easier to manage than on any other

    What does this mean to our customers?
    Industrialized SQL in Hadoop removes all barriers for business access to big data analytics
    Broad SQL access unleashes millions of business-savvy, SQL users with no constraints on Hadoop data to improve the accuracy of their analytical predictions and decision-making
    Turbocharged Hadoop analytics and SQL in Hadoop accelerates time to value and turns Hadoop data into transformational value: customer delight, competitive advantage, world-class risk management, disruptive business models
  • We want to partner with you to identify where the most obvious places where big data analytics could be applied to your organization.
  • SQL + Hadoop: The High Performance Advantage�

    1. 1. Confidential © 2014 Actian Corporation1 SQL + Hadoop: The High Performance Advantage Turn Hadoop into a High Performance Analytics Platform Emma McGrattan, Actian Jim Hare, Actian 8 July 2014
    2. 2. Confidential © 2014 Actian Corporation2 1. Introduction 2. Hadoop Challenges 3. Actian Analytics Platform – Hadoop SQL Edition 4. Industrialized, High Performance SQL in Hadoop 5. Questions Agenda All lines are muted To ask a question, use Chat or Q&A panel Recording will be made available We‘ll be running a few polling questions
    3. 3. Confidential © 2014 Actian Corporation3 $140M Revenues + Profitable 10,000+ Customers Global Presence: 8 world-wide offices, 7x 24 multinational support model 3 “Actian is now very powerfully positioned in the big data and analytics markets.” Robin Bloor Actian is Delivering Transformational Value “Actian has assembled all of the next generation IPs into a single analytics platform, allowing users a level of flexibility in data interaction that competitors have not been able to match.” siliconANGLE
    4. 4. Confidential © 2014 Actian Corporation4 Big Data Offers Significant Opportunities Personalized Experience New Products/Services Reduce Risk Predictive Analytics Many Data Sources Low Cost Storage …But only for those who embrace it Improve Decision-Making
    5. 5. Confidential © 2014 Actian Corporation5 Enter Hadoop as the Big Data Enabler for Low Cost Storage DW Offload Landing Zone Data Reservoir ?
    6. 6. Confidential © 2014 Actian Corporation6 But It isn’t Easy with Hadoop Batch performance Time to Value Expensive Skills Silo’d Data Access Data preparation
    7. 7. Confidential © 2014 Actian Corporation7 Hadoop Complexity Forcing Organizations to Move Data in order to Analyze it DW Offload Landing Zone Hadoop Data Reservoir Data Management Analytics Processing Visualization & Data Science Workbench Result: duplicate storage & infrastructure costs, more IT resources, network bandwidth usage, and complexity Data Transfer
    8. 8. Confidential © 2014 Actian Corporation8 CIOs Challenged by Big Data Costs One in three CIOs pay between 21 cents to 30 cents per gigabyte a month. Translation: it costs a company $3.12 million per year to store 500,000 gigabytes at an average cost of 26 cents per gigabyte per month. Source: http://www.cioinsight.com/it-strategy/storage/slideshows/cios-challenged-by-big-data-costs.html -- CIO Insight
    9. 9. Confidential © 2014 Actian Corporation9 CIOs Challenged by Types of Big Data 73% of CIOs day up to 50% of their data will be unstructured within two years. Source: http://www.cioinsight.com/it-strategy/storage/slideshows/cios-challenged-by-big-data-costs.html -- CIO Insight
    10. 10. Confidential © 2014 Actian Corporation10 Instead, what if you could move the analytic processing to the Hadoop data? Data Science Workbench Analytic Processing Data Management … And transform Hadoop from a data lake into a high performance, fully functional analytics platform SQL User Access
    11. 11. Confidential © 2014 Actian Corporation11 What is it? Introducing the Actian Analytics Platform – Hadoop SQL Edition Patented X100 vector processing engine plus visual data and analytics work flow, all running natively in Hadoop via YARN Turns Hadoop into a High-Performance, Fully-Functional Analytics Database How is this unique? Highest performing, most industrialized SQL access to Hadoop data Only end-to-end analytic processing natively in Hadoop Most consumable, accessible, manageable Hadoop analytics What does this mean to you? Removes all barriers for business access to big data analytics Enables SQL users with no constraints on Hadoop data Accelerates time to value
    12. 12. Confidential © 2014 Actian Corporation12 The Industry’s Abuzz – about Actian! “Deploying on Hadoop enables the Actian Analytics Platform to scale to massively parallel scale without having to modify the underlying engine. For Actian, Hadoop is a means to an end; it provides an opening for Actian to introduce a fast SQL engine that operates at scale.” Tony Baer, Principal Analyst, Software, Ovum “Actian’s platform now makes Hadoop data repositories accessible to the entire enterprise by empowering millions of business-savvy SQL users and business analysts to conduct advanced analytics directly on data in the Hadoop Distributed File System (HDFS). Companies investing in Hadoop now can broaden the scope of data discovery, increase the accuracy of decisions, and speed time to value.” Daniel Gutierrez, Inside Big Data “The latest version of the Actian Analytics Platform provides end-to-end analytic processing natively in Hadoop. This will make the Hadoop Big Data framework more accessible by offering high-performance ELT (extract, load and transform) and SQL analytics on Hadoop with no need for MapReduce skills. This is a big deal because data scientists with Hadoop skills are in short supply, while SQL skills are relatively abundant.”
    13. 13. Confidential © 2014 Actian Corporation13 Libraries of Analytics Hadoop Connections to Access Any Data Actian Analytics Platform – Hadoop SQL Edition Visual Data and Analytic Workbench High Performance Data Flow Engine Industrialized SQL Analytics Database Natively in Hadoop Removes all barriers for business access to big data analytics Business Processes Users Machines Applications Expansive Connectivity  Data Blending & Enrichment  Discovery  Data Science  Analytics  Operational BI Enterprise Data Machine Data Social Data Data Warehouse SaaS Data Amazon Redshift
    14. 14. Confidential © 2014 Actian Corporation14 Actian Analytics Platform – Hadoop SQL Edition Lightning fast and industrial strength SQL in Hadoop – Up to 30X faster than Impala Full end-to-end analytic processing platform - all native in Hadoop Packaged with “real world” solution blueprints
    15. 15. Confidential © 2014 Actian Corporation15 Visual Data Science & Analytics Workbench • Drag/drop interface with 100’s of data prep and analytic functions • Connect, blend, & enrich data and perform discovery & data science • Build and test predictive models • Running on top of a high performance data flow engine • All natively within Hadoop via YARN
    16. 16. Confidential © 2014 Actian Corporation16 Ubiquitous Skills ■ 1 Million+ SQL Users ■ $ Lower cost ■ Easy to find, in most companies ■ Embedded in the business Specialty Skills ■ 150K MapReduce Programmers ■ $$$ Expensive ■ 170K Shortage, hard to find ■ Separate from the business Unleash millions of business-savvy, SQL users with no constraints on Hadoop data Actian Analytics PlatformTM Analyze ActConnect +
    17. 17. Confidential © 2014 Actian Corporation17 Actian Analytics Platform = 25 Minutes Log Reader Filter Rows Group Load Vectork-Means Coding MapReduce = 4 Weeks Avro Writer MapReduce Code k-Means MapReduce Code Log Reader Filter Rows Group Load Vector MapReduce Code MapReduce Code MapReduce Code MapReduce Code Accelerate time to value and turn Hadoop data into transformational value
    18. 18. Confidential © 2014 Actian Corporation18 Vendor Approaches to “SQL on Hadoop” “marketing jobs” “wrapped legacy” “from scratch” SQL Outside Hadoop • Connector approach • MPP DB  need 2 clusters • Expensive, hard to manage Mature but non-Integrated • Legacy engine (e.g. Postgres) + top layer • Store data outside HDFS (local files) • Separate Failover Management (tools) Integrated but Immature • No trickle updates • Immature/poor optimizers+engines • I18N, security, workload mgmt, access control?
    19. 19. Confidential © 2014 Actian Corporation19 “wrapped legacy” “from scratch” Maturity (SQL support, ACID, reliability, security, connectivity, performance) Hadoop IntegrationLow Native High “marketing jobs” Mature & Integrated + + “SQL on Hadoop” Vendor Landscape
    20. 20. Confidential © 2014 Actian Corporation20 Confidential © 2014 Actian Corporation 20 Actian Vector Hadoop Edition Actian Analytics Platform Hadoop SQL Edition Actian Analytics Platform NameNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Prepare Standard SQL Interfaces Orchestrate Connect Connect to any data via Actian DataConnect Manage dataflow across the entire analytic process 6 POINTS OF INNOVATION: Vector Processing On Chip Cache Fast Real-time Updates Smart Compression Storage Indexes Multi-Core Parallelism Running natively in Hadoop via YARN Prepare, enrich, and analyze any data with Actian DataFlow NEXT GENERATION DATABASE TECHNOLOGY:: Columnar Compressed Storage Indexes
    21. 21. Confidential © 2014 Actian Corporation21 Actian Vector – Unmatched InnovationTime/CyclestoProcess Data Processed DISK RAM CHIP 10GB2-3GB40-400MB 2-20150-250Millions Vector Processing Single Instruction Multiple Data 2nd Gen Column Store Limit I/O Efficient real time updates Smarter Compression Maximize throughput Vectorized decompression Exploiting Chip Cache Process data on chip – not in RAM 1 2 3 4 Multi-core Parallelism Maximize system resource utilization… Storage Indexes Quickly identify candidate data blocks Minimize IO 5 6
    22. 22. Confidential © 2014 Actian Corporation22 TPC-H 1TB – Faster, Less Hardware 0 100,000 200,000 300,000 400,000 Actian Vector 445,529 Actian Vector 436,788 SQL Server 219,888 Oracle 209,534 Oracle 201,487 SQL Server 173,962 Sybase IQ 164,747 Oracle 140,181 SQL Server 134,117 June ‘12 May ‘11 Aug ‘11 June ‘11 Sept ‘11 Apr ‘11 Dec ‘10 Apr ‘10 Dec ‘11 $57,146 $1,229,968 $460,869 $2,402,706 $753,392 $278,527 $85,621 $1,249,967 $258,880 Hardware Cost (excluding discounts)QphH Fastest TPC-H QphH@1TB Benchmark (non-clustered) Source: www.tpc.org /
    23. 23. Confidential © 2014 Actian Corporation23 HADOOP YARN HDFS Standard SQL Interfaces DataNode HDFS Visual Data & Analytics Workflow Actian Analytics Platform – Hadoop SQL Edition Transform Hadoop into a High Performance Analytics Platform DataNode HDFS DataNode HDFS DataNode HDFS X100X100X100 Read Load Actian Vector Blend & Enrich Data Science & Analytics DataNode HDFS X100 HDFS Vector • Original file format • Standard block replication NameNode High Performance, Industrialized SQL Database High Performance, Parallelized Data Flow Engine • Column-based blocks • Compressed • Partitioned Replicated Vector • >=3 Replicated Copies of Vector Blocks • Leveraged to co- locate data with various join keys
    24. 24. Confidential © 2014 Actian Corporation24 History of the TPC-DS Comparison
    25. 25. Confidential © 2014 Actian Corporation25 Confidential © 2014 Actian Corporation 25 TPC-DS Benchmark Components Operational Systems Refresh Process Ad-hoc Reporting Queries User Queries DSS Database TPC-DS Reports Store Web Catalog Inventory Promotions Set of Files ETL
    26. 26. Confidential © 2014 Actian Corporation26 Actian Hadoop SQL Performance 0 5 10 15 20 25 30 35 Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98 “Impala Subset” of TPC-DS Queries at Scale Factor 3000 (3TB) Speedup vs Impala Impala Actian 16x avg. speedup Background to “Impala Subset “of TPC-DS benchmark can be found here: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ Both Executed on the Same Hardware and Software Environment: 5 Node Cluster with 64GB of RAM per node and 12x2TB Hard Disks. SpeedupFactor
    27. 27. Confidential © 2014 Actian Corporation27 Comprehensive – covers full analytic process: data blending & enrichment, discovery & data science, analytics & operational BI Accessible – standard ANSI SQL to support standard BI tools; plus key advanced analytics including cube, grouping sets and windowing functions Optimized – mature, proven planner and optimizer; optimal use of every node, CPU, memory, and cache Secure – native DBMS security including authentication, user and role-based security, data protection, and encryption Reliable - fully ACID-compliant with multi-version read consistency, plus system-wide failover protection Manageable – resources managed automatically in Hadoop via YARN Consumable – now usable by millions of users with every SQL tool and application on the planet Scalable – unlimited expansion to handle extreme #s of users, nodes, data Most Industrialized SQL in Hadoop
    28. 28. Confidential © 2014 Actian Corporation28 Actian Director for Management
    29. 29. Confidential © 2014 Actian Corporation29 Actian Analytics Platform – Hadoop SQL Edition Industrialized, High-Performance SQL in Hadoop Only end-to-end analytic processing natively in Hadoop Highest performing, most industrialized SQL in Hadoop Removes all barriers for business access to big data analytics Unleashes millions of business-savvy SQL users on Hadoop data Outperforms Cloudera’s Impala by up to 30x Actian transforms Hadoop from a data lake into a high- performance analytics platform.
    30. 30. Confidential © 2014 Actian Corporation30 Transform Hadoop – Transform your Business
    31. 31. Confidential © 2014 Actian Corporation31 3 Get started today! www.actian.com/hadoop Pre-register for an evaluation copy of Actian’s SQL in Hadoop bigdata.actian.com/ sql-in-hadoop Register for a Sand Hill Hadoop Survey Results webinar on July 24, 2014 bigdata.actian.com/ SandHill- Hadoop- Results 2 1
    32. 32. Confidential © 2014 Actian Corporation32 3 Get started today! www.actian.com/hadoop Pre-register for an evaluation copy of Actian’s SQL in Hadoop bigdata.actian.com/ sql-in-hadoop Register for a Sand Hill Hadoop Survey Results webinar on July 24, 2014 bigdata.actian.com/ SandHill- Hadoop- Results 2 1

    ×