SlideShare a Scribd company logo
Enabling Interactive BI on Hadoop
Boaz Raufman
CTO / Co-Founder
Jethro
Interactive BI is a Unique Use-Case
Data Science,
ETL,
Reporting,
Machine Learning
Interactive
BI
Non
Interactive
Managed Set of
Queries
Few
Concurrent Users
Interactive
Variety of
Generate Queries
Many
Concurrent Users
Interactive BI challenges: Performance
• My query is too slow!
• Resolution:
– Data engineering
• Partitioning, Sorting, De-normalize,
Pre-aggregation, Pre-calculation, etc.
– Increase cluster size
• Cost:
– Effort time and costs $$$
– Resources $$$
• Limitations
– Data engineering can’t optimize
all queries
Interactive BI challenges: Variety
• My dashboard generates many different
queries
– Multiple dimensions, multiple measures,
complex expressions, various filters, low/high
cardinality filters, various tables relations, …
• Resolution:
– More data engineering
• Cost:
– Effort time and costs $$$
– Delay application development and
deployment $$$
• Limitations:
– Impose limitation on app
– Performance degradation
Manual data engineering is costly and cannot completely
resolve the variety of business needs in timely manner
Interactive BI challenges: Concurrency
• Single dashboard interaction can
issue many queries
• I have many concurrent users
• Resolution:
– Increase cluster size
• Cost:
– Resources $$$
– Impact other work loads on my
Hadoop cluster
Resources resizing will never catch up with
business needs
SQL on Hadoop Engines don’t fit for Interactive BI
Pros
• General purpose
• Parallel execution
• Scalable resource utilization
• Eventually can resolve
every query via full scan
• Great for ETL, Reporting,
Machine learning, Data
Discovery
Cons
• Resource consuming
• Straggle with concurrency
• Optimizations require
manual data engineering
• Not optimized for variety
and concurrency
requirements of
interactive BI use cases
Interactive BI acceleration tool is complimentary to SQL on Hadoop Engines
Solution Requirements
• Consistent interactive response times (<10 sec)
• Handle efficiently variety of BI queries
• Minimal resource utilization per query allowing high
concurrency
• Scalable
• Automatic – data engineering should be handled by the data
platform
In addition:
• Consistent performance upon ingestion of new data
The Realm of Queries
Select * from …
Select sum(a),sum(b) Select sum(a), sum(b)
group by c,d
Select sum(a) Select sum(b)
Select a,b,d where e=x
Select sum(a), sum(b)
where c=y group by d
Select sum(a), sum(b)
where e=x group by d
We need to be optimized only for the sub-set of queries
that is relevant for Interactive BI
Jethro Adaptive Approach to Interactive BI
• Interactive BI is about visualizing data for humans
• It composed mainly of:
– Aggregations grouped by low cardinality dimensions
– Filters of either low or hi cardinality
• To handle aggregation we use pre-aggregation (cubes)
• To handle hi cardinality filtering we use indexes
• Engine adapts to dashboard queries
– Acceleration object automatically generated based on user
queries
Indexes
Cubes or Indexes? You need BOTH!
Type of Query DetailedSummary
good
perf Cubes
Cubes: good for accelerating Aggregated queries
– Poor at detailed queries
poor
perf
Indexes: good for accelerating Granular queries
– Poor at summary queries
Jethro is unique in providing BOTH - accelerates ALL queries
Heavy Lifting is done in the Background
Query
Servers
Cubes,
Indexes
Builder
Servers
Live Query
Answer
Queries from
Indexes and
Cubes
Background
Build
Indexes and
Cubes
Performance gain ~5x-50x
Cluster resources ~0.2X
Fully Automated
(stored on Hadoop)
LIVE Demo
• Point browser at: tableau.jethrodata.com
– Login: demo / demo
• Point browser at: jethrodata.qlik.com/
– No login needed
Compone
nt
AWS HW Monthly
Cost
Jethro
2x
120GB / 16
cores
$500 (spot)
Storage EFS $200
Data:
• Based on TPC-DS benchmark
• 1TB raw data
• Fact table: ~2.9B rows
• Dimension tables: 6
AWS Servers
Customer Row_IDs
1 1,4,9
4 10
6 8
7 2
14 5
23 6,7
32 3
Row_ID Customer Item Price
1 1 … …
2 7 … …
3 32 … …
4 1 … …
5 14 … …
6 23 … …
7 23 … …
8 6 … …
9 1 … …
10 4 … …
Jethro Indexes Accelerate BI Drill Downs
• Efficient
– EVERY column can be indexed
• Effective
– The more you filter, the faster it gets
– Dataset size doesn’t impact filtered query perf
• Efficient
– Multi-level index for direct access, no need for
in-mem
Users NOT dependent on a single partition col for performance
Index Table
Auto-Cubes: How it Works
state cust
,
prod
,…
$sale
AL $2.00
…
AK $4.50
…
AZ $1.00
…
… …
… …
WY $4.25
Customer query:
select sum(sales)
… where state=‘AZ’
Process:
use index to find all rows
for ‘AZ’. Sum $sale for
selected rows
Response: $1,643
sales transactions
(5B rows)
sales-by-state (50 rows)
State $sale
AK $256
AZ $1,643
… …
WY $4,654
Jethro auto gen query
(move filter col into group by):
select sum(sales) …
group by state
Subsequent queries served
from auto-cube:
where state=‘AK’
where state in (‘CA’, ‘NY’)
Jethro Auto-Cubes Accelerate BI Aggregations
• Automated
– Based on actual BI queries
• Adaptive
– Automatically adjust to changes in apps and
data
• Efficient
– Dozens of small and highly efficient cubes,
matching every aggregation
– Use indexes for granular queries instead of
creating large cubes
state cust
,
prod
,…
$sale
AL $2.00
… …
AK $4.50
AZ $1.00
AZ …
… …
WY $4.25
Jethro Auto Cubes drive uninterrupted self-service BI
sales
transactions
(5B rows)
Stat
e
$sale
AK $256
AZ $1,643
… …
WY $4,654
sales
by State
(50 rows)
Jethro Query Optimization Process
1. Result-Cache
• Exact repeat of
prev query
• Results were saved
in storage
2. Auto Cube
• Scan existing
cubes for a match
• Cubes evaluated
from smallest to
largest
3. Index Access
• Apply filters using
indexes
• Fetch and process
ONLY relevant
rows and cols
Optimizer
• Rewrite query: join elimination, partition pruning,
predicate push down…
• Select best execution path: cache, cubes or indexes
The BEST way to speed up a SQL query is have it do LESS work
Incremental Updates Do not Impact Performance
Original
Incremental
IndexesCubesData
Background
Incremental update of Indexes and Cubes
ETL
Watch
Folder
Scales to 1,000’s of Users
…
• Servers are stateless, data centrally
shared
– Cubes, indexes, results shared by
servers
• Automated load balancing
– Dynamically add / drop Jethro servers
• Minimal sensitivity to cluster load
– Segregate workload by designating
specific servers to specific groups
…
Stressed and Hardened by Customers in Production
Jethro and Integration (Hive 3)
security
Querie
s
Sentry
Performance, Scale, Cost
• Performance – responds in seconds
– ALL BI queries, 100’s of concurrent users, BB’s of rows
• Self driving – no manual performance engineering
– Cubes and Indexes are fully automated
• Resource efficiency – reduced cluster usage
– All BI compute on Jethro nodes, significantly fewer resources
• App compatibility – “as is”
– No changes to BI apps or data model
EDW Performance at Hadoop Scale & Cost
Thanks You
Backup Slides
Jethro System Diagram
Client Applications
• Commercial BI Tools
• Homegrown Viz Apps
• SQL Clients
SQL 92 via ODBC / JDBC
• AutoCubes
• Full Indexing
• Intelligent Cache
Source Data
• Hadoop (Hive, Impala,…)
• EDW
• Text Files
Jethro Acceleration Engine
Any ETL
• Cube and Index Builder
Jethro Manager
Network
Storage
Interactive BI Market Map
Non interactive
Interactive
Full-Scan Full-Scan
Manual
Cube
Auto
Cube
Auto
Index
Data
Science
Interactive
BI
Customer Insights & Profitability
 Industry: Car Rental
– Leading global car rental
– Multiple brands, 5,000+ locations,
150+ countries
– MM’s of transactions, BB’s of
marketing and sales data points
 Results:
– Performance: dashboards return in
10sec instead of 10min
– Self-Service: end-users are able to
create own analytics without IT
– Data Lake: data for all brands and
geos in one place
Before After
Leading Car Rental Company
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Oracle Data Mart
Transactions, marketing
Tableau
Transactions, marketing
Tableau
After
Physician Patient Tracking
 Industry: Health Care
– Leading data & tech provider in the
health care industry
– 500 healthcare organizations, 850K
physicians, 375K clinical facilities, more
than 230M Americans
 Results
– Scale: 1,000’s of concurrent users
– Performance: 85% of interactions
under 5sec
– Security: Access control by user; HIPAA
Before After
Leading Health Data Provider
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Teradata Data Mart
Physician / Patient Details
Tableau
Physician / Patient Details
Tableau
After
Financial operational apps over
 Industry: Banking
– Top 15 global Bank
– Operations in 35+ countries
– Personal, business, public sector and
institutional clients
 Results
– Functional: offload BI apps “as-is” from
legacy EDW to Hadoop
– $Savings: eliminate need for annual
EDW expansion
– ROI: increase usage and value of data
lake investment
Before After
Hortonworks HDP
Jethro Acceleration
Hortonworks HDP
Vertica, other EDW Data
Marts
Many data sources
Tableau, other BI
Many data sources
Tableau, other BI
After

More Related Content

What's hot

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
DataWorks Summit
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
DataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
DataWorks Summit
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
DataWorks Summit
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
DataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
DataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
Hortonworks
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
DataWorks Summit
 

What's hot (20)

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerBreathing New Life into Apache Oozie with Apache Ambari Workflow Manager
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 

Similar to Enabling real interactive BI on Hadoop

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Remy Rosenbaum
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Remy Rosenbaum
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
DataWorks Summit
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
Kyle Davis
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
Sri Ambati
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
Simon Belak
 
The New Model
The New ModelThe New Model
The New Model
David Kaiser
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
MariaDB plc
 
Oracle bi ee architecture
Oracle bi ee architectureOracle bi ee architecture
Oracle bi ee architecture
OBIEE Training Online
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analytics
MariaDB plc
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
Remy Rosenbaum
 
Informix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performanceInformix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performance
Keshav Murthy
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
Alexandra Sasha Blumenfeld
 
HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Elasticsearch
 

Similar to Enabling real interactive BI on Hadoop (20)

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
The New Model
The New ModelThe New Model
The New Model
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
 
Oracle bi ee architecture
Oracle bi ee architectureOracle bi ee architecture
Oracle bi ee architecture
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analytics
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Informix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performanceInformix & IWA : Operational analytics performance
Informix & IWA : Operational analytics performance
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
 
HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
HDF5 FastQuery
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

Enabling real interactive BI on Hadoop

  • 1. Enabling Interactive BI on Hadoop Boaz Raufman CTO / Co-Founder Jethro
  • 2. Interactive BI is a Unique Use-Case Data Science, ETL, Reporting, Machine Learning Interactive BI Non Interactive Managed Set of Queries Few Concurrent Users Interactive Variety of Generate Queries Many Concurrent Users
  • 3. Interactive BI challenges: Performance • My query is too slow! • Resolution: – Data engineering • Partitioning, Sorting, De-normalize, Pre-aggregation, Pre-calculation, etc. – Increase cluster size • Cost: – Effort time and costs $$$ – Resources $$$ • Limitations – Data engineering can’t optimize all queries
  • 4. Interactive BI challenges: Variety • My dashboard generates many different queries – Multiple dimensions, multiple measures, complex expressions, various filters, low/high cardinality filters, various tables relations, … • Resolution: – More data engineering • Cost: – Effort time and costs $$$ – Delay application development and deployment $$$ • Limitations: – Impose limitation on app – Performance degradation Manual data engineering is costly and cannot completely resolve the variety of business needs in timely manner
  • 5. Interactive BI challenges: Concurrency • Single dashboard interaction can issue many queries • I have many concurrent users • Resolution: – Increase cluster size • Cost: – Resources $$$ – Impact other work loads on my Hadoop cluster Resources resizing will never catch up with business needs
  • 6. SQL on Hadoop Engines don’t fit for Interactive BI Pros • General purpose • Parallel execution • Scalable resource utilization • Eventually can resolve every query via full scan • Great for ETL, Reporting, Machine learning, Data Discovery Cons • Resource consuming • Straggle with concurrency • Optimizations require manual data engineering • Not optimized for variety and concurrency requirements of interactive BI use cases Interactive BI acceleration tool is complimentary to SQL on Hadoop Engines
  • 7. Solution Requirements • Consistent interactive response times (<10 sec) • Handle efficiently variety of BI queries • Minimal resource utilization per query allowing high concurrency • Scalable • Automatic – data engineering should be handled by the data platform In addition: • Consistent performance upon ingestion of new data
  • 8. The Realm of Queries Select * from … Select sum(a),sum(b) Select sum(a), sum(b) group by c,d Select sum(a) Select sum(b) Select a,b,d where e=x Select sum(a), sum(b) where c=y group by d Select sum(a), sum(b) where e=x group by d We need to be optimized only for the sub-set of queries that is relevant for Interactive BI
  • 9. Jethro Adaptive Approach to Interactive BI • Interactive BI is about visualizing data for humans • It composed mainly of: – Aggregations grouped by low cardinality dimensions – Filters of either low or hi cardinality • To handle aggregation we use pre-aggregation (cubes) • To handle hi cardinality filtering we use indexes • Engine adapts to dashboard queries – Acceleration object automatically generated based on user queries
  • 10. Indexes Cubes or Indexes? You need BOTH! Type of Query DetailedSummary good perf Cubes Cubes: good for accelerating Aggregated queries – Poor at detailed queries poor perf Indexes: good for accelerating Granular queries – Poor at summary queries Jethro is unique in providing BOTH - accelerates ALL queries
  • 11. Heavy Lifting is done in the Background Query Servers Cubes, Indexes Builder Servers Live Query Answer Queries from Indexes and Cubes Background Build Indexes and Cubes Performance gain ~5x-50x Cluster resources ~0.2X Fully Automated (stored on Hadoop)
  • 12. LIVE Demo • Point browser at: tableau.jethrodata.com – Login: demo / demo • Point browser at: jethrodata.qlik.com/ – No login needed Compone nt AWS HW Monthly Cost Jethro 2x 120GB / 16 cores $500 (spot) Storage EFS $200 Data: • Based on TPC-DS benchmark • 1TB raw data • Fact table: ~2.9B rows • Dimension tables: 6 AWS Servers
  • 13. Customer Row_IDs 1 1,4,9 4 10 6 8 7 2 14 5 23 6,7 32 3 Row_ID Customer Item Price 1 1 … … 2 7 … … 3 32 … … 4 1 … … 5 14 … … 6 23 … … 7 23 … … 8 6 … … 9 1 … … 10 4 … … Jethro Indexes Accelerate BI Drill Downs • Efficient – EVERY column can be indexed • Effective – The more you filter, the faster it gets – Dataset size doesn’t impact filtered query perf • Efficient – Multi-level index for direct access, no need for in-mem Users NOT dependent on a single partition col for performance Index Table
  • 14. Auto-Cubes: How it Works state cust , prod ,… $sale AL $2.00 … AK $4.50 … AZ $1.00 … … … … … WY $4.25 Customer query: select sum(sales) … where state=‘AZ’ Process: use index to find all rows for ‘AZ’. Sum $sale for selected rows Response: $1,643 sales transactions (5B rows) sales-by-state (50 rows) State $sale AK $256 AZ $1,643 … … WY $4,654 Jethro auto gen query (move filter col into group by): select sum(sales) … group by state Subsequent queries served from auto-cube: where state=‘AK’ where state in (‘CA’, ‘NY’)
  • 15. Jethro Auto-Cubes Accelerate BI Aggregations • Automated – Based on actual BI queries • Adaptive – Automatically adjust to changes in apps and data • Efficient – Dozens of small and highly efficient cubes, matching every aggregation – Use indexes for granular queries instead of creating large cubes state cust , prod ,… $sale AL $2.00 … … AK $4.50 AZ $1.00 AZ … … … WY $4.25 Jethro Auto Cubes drive uninterrupted self-service BI sales transactions (5B rows) Stat e $sale AK $256 AZ $1,643 … … WY $4,654 sales by State (50 rows)
  • 16. Jethro Query Optimization Process 1. Result-Cache • Exact repeat of prev query • Results were saved in storage 2. Auto Cube • Scan existing cubes for a match • Cubes evaluated from smallest to largest 3. Index Access • Apply filters using indexes • Fetch and process ONLY relevant rows and cols Optimizer • Rewrite query: join elimination, partition pruning, predicate push down… • Select best execution path: cache, cubes or indexes The BEST way to speed up a SQL query is have it do LESS work
  • 17. Incremental Updates Do not Impact Performance Original Incremental IndexesCubesData Background Incremental update of Indexes and Cubes ETL Watch Folder
  • 18. Scales to 1,000’s of Users … • Servers are stateless, data centrally shared – Cubes, indexes, results shared by servers • Automated load balancing – Dynamically add / drop Jethro servers • Minimal sensitivity to cluster load – Segregate workload by designating specific servers to specific groups …
  • 19. Stressed and Hardened by Customers in Production
  • 20. Jethro and Integration (Hive 3) security Querie s Sentry
  • 21. Performance, Scale, Cost • Performance – responds in seconds – ALL BI queries, 100’s of concurrent users, BB’s of rows • Self driving – no manual performance engineering – Cubes and Indexes are fully automated • Resource efficiency – reduced cluster usage – All BI compute on Jethro nodes, significantly fewer resources • App compatibility – “as is” – No changes to BI apps or data model EDW Performance at Hadoop Scale & Cost
  • 24. Jethro System Diagram Client Applications • Commercial BI Tools • Homegrown Viz Apps • SQL Clients SQL 92 via ODBC / JDBC • AutoCubes • Full Indexing • Intelligent Cache Source Data • Hadoop (Hive, Impala,…) • EDW • Text Files Jethro Acceleration Engine Any ETL • Cube and Index Builder Jethro Manager Network Storage
  • 25. Interactive BI Market Map Non interactive Interactive Full-Scan Full-Scan Manual Cube Auto Cube Auto Index Data Science Interactive BI
  • 26. Customer Insights & Profitability  Industry: Car Rental – Leading global car rental – Multiple brands, 5,000+ locations, 150+ countries – MM’s of transactions, BB’s of marketing and sales data points  Results: – Performance: dashboards return in 10sec instead of 10min – Self-Service: end-users are able to create own analytics without IT – Data Lake: data for all brands and geos in one place Before After Leading Car Rental Company Hortonworks HDP Jethro Acceleration Hortonworks HDP Oracle Data Mart Transactions, marketing Tableau Transactions, marketing Tableau After
  • 27. Physician Patient Tracking  Industry: Health Care – Leading data & tech provider in the health care industry – 500 healthcare organizations, 850K physicians, 375K clinical facilities, more than 230M Americans  Results – Scale: 1,000’s of concurrent users – Performance: 85% of interactions under 5sec – Security: Access control by user; HIPAA Before After Leading Health Data Provider Hortonworks HDP Jethro Acceleration Hortonworks HDP Teradata Data Mart Physician / Patient Details Tableau Physician / Patient Details Tableau After
  • 28. Financial operational apps over  Industry: Banking – Top 15 global Bank – Operations in 35+ countries – Personal, business, public sector and institutional clients  Results – Functional: offload BI apps “as-is” from legacy EDW to Hadoop – $Savings: eliminate need for annual EDW expansion – ROI: increase usage and value of data lake investment Before After Hortonworks HDP Jethro Acceleration Hortonworks HDP Vertica, other EDW Data Marts Many data sources Tableau, other BI Many data sources Tableau, other BI After