SlideShare a Scribd company logo
Solving performance problems on Hadoop
Moving analytic workloads into production
1
Tyler Mitchell
Sr. Software Engineer
Actian Center of Excellence
Topics
How we got (stuck) here
Performance best practises
Sample business cases
Benchmarking results
2
Actian’s Lineage
Ingres – 1970’s Versant – 1988 ParAccel – 2006
Pervasive – 1982 Vectorwise – 2003
3
Actian
Actian at a Glance
4
10,000+
8 Countries; 7 US Cities
HQ Palo Alto
400+
Employees Customers
3
Businesses
Banking, Insurance
Telecom and Media
Data Management
Data Integration
Big Data Analytics
How We Got (Stuck) Here
5
Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture
Data Management
& Integration
Analytics
Query & Analyze
Solutions
Problem
Solved
Accidental Hadoop Tourist – Brief History
7
DataBusiness
Data Capture
Data Management
& Integration
Analytics Solutions
??????
Accidental Hadoop Tourist – Brief History
8
DataBusiness
Data Capture
Data Management
& Integration
Analytics
???
Solutions
???
Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue,
reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively
scalable platform
Decisive reduction in the cost of high performance analytics at scale
Performance that can meet all SLAs
Full leverage of existing SQL skills while deploying a modern analytic
infrastructure
Grow
Revenue
Reduce
Cost
Mitigate
Risk
Create
New
Business
Business Solution Architecture Challenges
Wide Ranges of Use Cases
10
Financial Services
Advanced Credit
Risk Analytics
across billions of
data points
Internet Scale
Application
Predictive
Analytics across
hundreds of
millions of
customers
Media
Data Science and
Discovery across
trillions of IoT
events
Dept of Defense
Cyber-Security:
Network
intrusion models
every second
Credit Card
Processing
Fraud
detection
every milli-
second
Performance Best Practises
11
3 Essential Big Data Concepts
12
0. Take nothing for granted
1. Partitioning vs Data skew
2. Data types matter
3. Maximize memory / minimize bottlenecks
4. Take nothing for granted
6 Game Changing Database Innovations
13
6 Game Changing Database Innovations
14
1. Use the CPU! – Vector Processing
2. Minimize bottlenecks – Exploiting Chip Cache
3. Got columnar?
4. Smarter compression
5. Smarter indexing
6. Multi-core matters
Actian VectorH Innovations
15
Big Data Business Use Cases
16
Customer 360: Understanding Experience, Driving Revenue
17
Telecom Challenge
Vast and growing repository of proprietary click data, customer records, service
call records, smart phone and device data GPS location, webserver, telephone,
network usage.
Queries took minutes or hours, and sometimes never returned at all.
Critical business analysis on a consolidated customer 360 data lake was
grinding to a halt.
The ability to gain deeper market insights, visualization and desired data
management and operational optimization was at risk
Customer 360: Initial Architecture
18
Development System
• 300+ node cluster
• HIVE access
• SQL based BI / Data Science
• Pre-processed as performance was unacceptable
• Views taking days to return snapshot views
Customer 360: Technical Improvements
19
Production Prototype
• 30 node cluster (10% of Hive)
• Actian Vector on Hadoop solution
• SQL based BI / Data Science
• No materialized view building required
• Join on demand faster than aggregate tables in Hive
• Reduced storage requirements
• 91TB – two years data, 1100 columns when joined
Customer 360: Understanding Experience, Driving Revenue
20
Results
Customer 360 across prior data silos
Leveraged for customer retention strategies
Predict and take proactive, tailored
responses
Enables next gen data-driven
troubleshooting, impact analysis and root
cause analysis
• Accelerated operations intelligence
• Improved customer experience
• Reduced customer churn
Impact
Financial Risk: Upgrading Legacy to Meet SLA
21
Challenge
Legacy single-purpose risk application took 3 hours to generate end-of-day risk report,
and failed to meet changing SLA’s for reporting risk.
In deciding to replace risk application, bank opted to build a multi-purpose risk
application, addressing multiple business requirements
Financial Risk: Upgrading Legacy to Meet SLA
22
Legacy System
• Single server architecture, MS SSAS, Oracle - ~30 applications
• Pre-processing of desired measures exploding data volumes
• Cube and Analysis engines being maxed out as they exceed 1.5TB range
• Unable to scale to the desired range of > 200GB/day new data
• Impala attempt failed
• Highly invested in apps built on Analysis service
Financial Risk: Upgrading Legacy to Meet SLA
23
New Possibilities
• Clustered solution – Hadoop 5 and 10 node
• No pre-processing cubes, SSAS partly kept
• Tested solutions 1TB -> 20TB at a time
• Produced interactive queries across large datasets
• Focused query results in 2s or less
• Processing all data in the database 6s – 80s
• 2x nodes ~ 200% speed improvement
Financial Risk: Upgrading Legacy to Meet SLA
24
Results
Increased data analyzed by 100X
2–200B rows / 1-20TB
Report run in 28 seconds vs. 3 hours
Use of application for:
• Intra-day reporting (surveillance)
• End of day reporting (compliance)
• Overnight float investment
options
• Annual CCAR Analysis
ActualGoal
Delivering the Results With Better Engineering
25
Technical Benchmarks
26
Technical Benchmarks – Single Machine
27
Technical Benchmarks – Single Machine
28
Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *
VectorH vs other platforms, faster by how much?
Tuned platforms
Identical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM,
24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:
• skip blocks via MinMax indexes
• sophisticated query processing
• efficient block format, esp. 64-bit int
Summary
Conscientious data handling & next gen engineering takes SQL
in Hadoop to new levels.
All Hadoop users can move from development into production
while delivering compelling business results.
31
Delivering the Results With Better Engineering
32
VectorH v5 – Spark integration, external table support, and more
SIGMOD 2016 Paper
33
Thank you!
tyler.mitchell@actian.com - @1tylermitchell
Blogs at Actian.com - MakeDataUseful.com
Visit us in booth 503
34

More Related Content

What's hot

Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske Bank
DataWorks Summit/Hadoop Summit
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
DataWorks Summit/Hadoop Summit
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
DataWorks Summit
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarVersa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
Shawn Rao
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
DataWorks Summit
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
DataWorks Summit
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLAS
DataWorks Summit
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
Pat Patterson
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
Zaloni
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske Bank
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarVersa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLAS
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 

Viewers also liked

Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Beyond TCO
Beyond TCOBeyond TCO
Workload Automation + Hadoop?
Workload Automation + Hadoop?Workload Automation + Hadoop?
Workload Automation + Hadoop?
DataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
DataWorks Summit/Hadoop Summit
 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizing
Lowy Shin
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
VMware Tanzu
 
Wall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache GeodeWall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache Geode
Andre Langevin
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
VMware Tanzu
 
Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2
VMware Tanzu
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
Perficient, Inc.
 
Why is my Hadoop* job slow?
Why is my Hadoop* job slow?Why is my Hadoop* job slow?
Why is my Hadoop* job slow?
DataWorks Summit/Hadoop Summit
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
DataWorks Summit/Hadoop Summit
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Workload Automation + Hadoop?
Workload Automation + Hadoop?Workload Automation + Hadoop?
Workload Automation + Hadoop?
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizing
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
Wall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache GeodeWall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache Geode
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
 
Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
 
Why is my Hadoop* job slow?
Why is my Hadoop* job slow?Why is my Hadoop* job slow?
Why is my Hadoop* job slow?
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 

Similar to Solving Performance Problems on Hadoop

The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
SingleStore
 
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Mike Rossi
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
Abdelkrim Hadjidj
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
sap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hanasap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hana
James L. Lee
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
081622tdwi.pdf
081622tdwi.pdf081622tdwi.pdf
081622tdwi.pdf
Alex446314
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
DATAVERSITY
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of Business
Ali Hodroj
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
Big Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBig Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real Time
BigDataExpo
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Arcadia Data
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
Attunity
 
Top SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadTop SAP Online training institute in Hyderabad
Top SAP Online training institute in Hyderabad
AadhyaKrishnan
 
Yellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcastYellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcast
Yellowbrick Data
 
OpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesOpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case Studies
Datavail
 

Similar to Solving Performance Problems on Hadoop (20)

The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
sap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hanasap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hana
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
081622tdwi.pdf
081622tdwi.pdf081622tdwi.pdf
081622tdwi.pdf
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of Business
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Big Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBig Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real Time
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Top SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadTop SAP Online training institute in Hyderabad
Top SAP Online training institute in Hyderabad
 
Yellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcastYellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcast
 
OpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesOpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case Studies
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 

Solving Performance Problems on Hadoop

  • 1. Solving performance problems on Hadoop Moving analytic workloads into production 1 Tyler Mitchell Sr. Software Engineer Actian Center of Excellence
  • 2. Topics How we got (stuck) here Performance best practises Sample business cases Benchmarking results 2
  • 3. Actian’s Lineage Ingres – 1970’s Versant – 1988 ParAccel – 2006 Pervasive – 1982 Vectorwise – 2003 3 Actian
  • 4. Actian at a Glance 4 10,000+ 8 Countries; 7 US Cities HQ Palo Alto 400+ Employees Customers 3 Businesses Banking, Insurance Telecom and Media Data Management Data Integration Big Data Analytics
  • 5. How We Got (Stuck) Here 5
  • 6. Accidental Hadoop Tourist – Brief History 6 DataBusiness Data Capture Data Management & Integration Analytics Query & Analyze Solutions Problem Solved
  • 7. Accidental Hadoop Tourist – Brief History 7 DataBusiness Data Capture Data Management & Integration Analytics Solutions ??????
  • 8. Accidental Hadoop Tourist – Brief History 8 DataBusiness Data Capture Data Management & Integration Analytics ??? Solutions ???
  • 9. Modern, best-in-class analytic database technology provides: 9 Measureable business impact: monetize Big Data to grow revenue, reduce cost, mitigate risk, enable new business The ability to make data driven business decisions using a massively scalable platform Decisive reduction in the cost of high performance analytics at scale Performance that can meet all SLAs Full leverage of existing SQL skills while deploying a modern analytic infrastructure Grow Revenue Reduce Cost Mitigate Risk Create New Business Business Solution Architecture Challenges
  • 10. Wide Ranges of Use Cases 10 Financial Services Advanced Credit Risk Analytics across billions of data points Internet Scale Application Predictive Analytics across hundreds of millions of customers Media Data Science and Discovery across trillions of IoT events Dept of Defense Cyber-Security: Network intrusion models every second Credit Card Processing Fraud detection every milli- second
  • 12. 3 Essential Big Data Concepts 12 0. Take nothing for granted 1. Partitioning vs Data skew 2. Data types matter 3. Maximize memory / minimize bottlenecks 4. Take nothing for granted
  • 13. 6 Game Changing Database Innovations 13
  • 14. 6 Game Changing Database Innovations 14 1. Use the CPU! – Vector Processing 2. Minimize bottlenecks – Exploiting Chip Cache 3. Got columnar? 4. Smarter compression 5. Smarter indexing 6. Multi-core matters
  • 16. Big Data Business Use Cases 16
  • 17. Customer 360: Understanding Experience, Driving Revenue 17 Telecom Challenge Vast and growing repository of proprietary click data, customer records, service call records, smart phone and device data GPS location, webserver, telephone, network usage. Queries took minutes or hours, and sometimes never returned at all. Critical business analysis on a consolidated customer 360 data lake was grinding to a halt. The ability to gain deeper market insights, visualization and desired data management and operational optimization was at risk
  • 18. Customer 360: Initial Architecture 18 Development System • 300+ node cluster • HIVE access • SQL based BI / Data Science • Pre-processed as performance was unacceptable • Views taking days to return snapshot views
  • 19. Customer 360: Technical Improvements 19 Production Prototype • 30 node cluster (10% of Hive) • Actian Vector on Hadoop solution • SQL based BI / Data Science • No materialized view building required • Join on demand faster than aggregate tables in Hive • Reduced storage requirements • 91TB – two years data, 1100 columns when joined
  • 20. Customer 360: Understanding Experience, Driving Revenue 20 Results Customer 360 across prior data silos Leveraged for customer retention strategies Predict and take proactive, tailored responses Enables next gen data-driven troubleshooting, impact analysis and root cause analysis • Accelerated operations intelligence • Improved customer experience • Reduced customer churn Impact
  • 21. Financial Risk: Upgrading Legacy to Meet SLA 21 Challenge Legacy single-purpose risk application took 3 hours to generate end-of-day risk report, and failed to meet changing SLA’s for reporting risk. In deciding to replace risk application, bank opted to build a multi-purpose risk application, addressing multiple business requirements
  • 22. Financial Risk: Upgrading Legacy to Meet SLA 22 Legacy System • Single server architecture, MS SSAS, Oracle - ~30 applications • Pre-processing of desired measures exploding data volumes • Cube and Analysis engines being maxed out as they exceed 1.5TB range • Unable to scale to the desired range of > 200GB/day new data • Impala attempt failed • Highly invested in apps built on Analysis service
  • 23. Financial Risk: Upgrading Legacy to Meet SLA 23 New Possibilities • Clustered solution – Hadoop 5 and 10 node • No pre-processing cubes, SSAS partly kept • Tested solutions 1TB -> 20TB at a time • Produced interactive queries across large datasets • Focused query results in 2s or less • Processing all data in the database 6s – 80s • 2x nodes ~ 200% speed improvement
  • 24. Financial Risk: Upgrading Legacy to Meet SLA 24 Results Increased data analyzed by 100X 2–200B rows / 1-20TB Report run in 28 seconds vs. 3 hours Use of application for: • Intra-day reporting (surveillance) • End of day reporting (compliance) • Overnight float investment options • Annual CCAR Analysis ActualGoal
  • 25. Delivering the Results With Better Engineering 25
  • 27. Technical Benchmarks – Single Machine 27
  • 28. Technical Benchmarks – Single Machine 28
  • 29. Technical Benchmarks: VectorH - SQL on Hadoop 29 TPC-H SF1000 * VectorH vs other platforms, faster by how much? Tuned platforms Identical hardware ** * Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
  • 30. Actian VectorH Delivers More Efficient File Format 30 Better compression & functionality Vector advantages: • skip blocks via MinMax indexes • sophisticated query processing • efficient block format, esp. 64-bit int
  • 31. Summary Conscientious data handling & next gen engineering takes SQL in Hadoop to new levels. All Hadoop users can move from development into production while delivering compelling business results. 31
  • 32. Delivering the Results With Better Engineering 32 VectorH v5 – Spark integration, external table support, and more
  • 34. Thank you! tyler.mitchell@actian.com - @1tylermitchell Blogs at Actian.com - MakeDataUseful.com Visit us in booth 503 34

Editor's Notes

  1. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  2. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  3. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  4. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  5. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  6. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.