SlideShare a Scribd company logo
JethroData 
Indexed Based SQL-on- 
Hadoop - An Architectural 
Comparison of Tools 
Simpler. Faster. Cheaper.
About the presenter 
Boaz Raufman – Co-Founder / CTO 
• Over 25 years experience in software design & mgmt 
• Expertise in database architecture, information 
retrieval and search technologies 
• Led numerous information retrieval projects for 
various Israeli intelligence agencies as well as for 
commercial companies 
• Started JethroData in 2010 with the idea of integrating 
database and search technologies to accelerate big 
data analytics 
• Bachelor's degree in Computer Science and Philosophy 
from the Tel-Aviv University
SQL-on-Hadoop 
Hadoop uses the same parallel design pattern as 
the parallel databases from last decade 
Frameworks 
• MapReduce 
• Tez 
• Spark 
Reborn on 
Hadoop 
• Pivotal 
HAWQ 
• IBM BigSQL 
• Teradata 
Aster 
• Actian 
New Comers 
• Hive 
• Impala 
• Presto 
• Tajo 
• Drill 
• Spark SQL
Data 
Node 
Full-Scan Execution 
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day 
Data 
Node 
Data 
Node 
Data 
Node 
Data 
Node 
Query 
Executor 
Query 
Executor 
Query 
Executor 
Query 
Executor 
Query 
Executor 
Query 
Planner 
/Mgr 
Query 
Planner/ 
Mgr 
Query 
Planner/ 
Mgr 
Query 
Planner/ 
Mgr 
Query 
Planner/ 
Mgr 
Performance and resources based on the size of the dataset
Shared-Nothing MPP Design 
Principles 
Parallel Processing 
• Divide the work across 
many nodes 
• Try to minimize inter-node 
communication 
• Work should be evenly 
distributed 
Full Data Scanning 
• Full sequential scan - 
massive I/O 
• Data locality and local 
processing 
• Minimize amount of 
data being read 
– Columnar data store 
– Partition by specific key 
– Block stats 
Performance and resource requirement 
based on the dataset size
MPP Complex queries processing 
Result 
Merge 
Global 
Aggregation 
(join, distinct, 
group by, 
order by, 
sub-query) 
Local 
Aggregation 
Example: 
SELECT 
DAY, 
COUNT( DISTINCT ITEM) 
FROM T1 
WHERE 
PRODUCT=‘abc’ 
GROUP BY DAY
Index based SQL-on-Hadoop
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day 
Data 
Node 
The Index-Access Design 
Data 
Node 
Data 
Node 
Data 
Node 
Data 
Node 
Jethro 
Query 
Node 
Query 
Node 
1. Index Access 2. Read data only for require rows 
Performance and resources based on the size of the result-set
Index-Based Design 
• Surgical scan – minimum I/O 
• Performance and required resources based 
on the result set size 
• Extremely efficient for Interactive SQL use 
cases 
• Pay at load time
Architecture – Contrarian Concepts 
• Index everything 
– Every column is indexed 
• Colum oriented 
– Columnar or row-groups 
– Append only data model 
• Everything is stored in HDFS 
– Can also work with S3 or Posix 
• Shared Everything 
– Separate compute and storage, 
each scales-out independently 
– Minimize cross-node operations 
– Stateless Query nodes 
• Parallelized multi-threaded 
execution 
– multiple parallelization dimensions: 
columns, row ranges, partitions, 
pipelining and bucketing 
Jethro 
Node 
Jethro 
Node 
Client 
Processing Layer 
Storage Layer 
HDFS/Posix/S3
Jethro Indexes 
 Indexes map each column 
value to a list of rows 
 Jethro stores indexes as 
Value Rows 
FR rows 5,9,10,11,14 
IL rows 1,3,7,12,13 
US rows 2,4,6,8,15 
hierarchical compressed bitmaps 
 Very fast query operations – AND / OR / NOT 
 Processed the entire WHERE clause to a final list of rows 
 Patent pending: 
http://www.google.com/patents/WO2013001535A3?cl=en 
 INSERT Performance 
– Load is very fast: files are appended, no random read/write, no locks 
– Jethro Indexes are append-only. If needed, duplicate entries are allowed 
– Periodic background merge (non-blocking) 
– Compatible with HDFS
Built in optimizations 
• Code is written in C++ 
• Column store and true column processing 
• Vectorization for expression evaluation 
• Multi-threaded and parallelized execution 
• Planner using indexes meta data - index-based 
queries 
• Server-side cache in memory and local disk
Use-Case Analysis 
Full-Scan: Performance depends on size of dataset 
Index-Access: Performance depends on size of result-set
Comparing Recent Benchmarks – Jun 2014 
Impala Parquet Vs. Hive/Tez, Presto, Shark 
Source 
Jethro Vs. Impala/Parquet 
Source 
Impala 
Using the same queries in Jun-2014 
Impala benchmarks, we compared 
Impala with Jethro (TPC-DS, SF 1,000)
Benchmark – Jethro vs. Impala – Oct 2014 
103 
TPC-DS Interactive queries 
Oct 2014 
39.8 39.4 39.9 
73.4 
188.4 
84.3 85.2 
6.4 5 4.9 4 
12.3 11.7 10.3 4.4 
200 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
q19 q42 q52 q55 q63 q68 q73 q98 
Impala 1.4.2 Jethro 0.9 
* Queries use original TPC-DS filter criteria 
*
DEMO 
Go to Tableau demo 
1. Point browser at: http://54.245.114.83/ 
2. Login as try-jethro/jethro123 
3. Edit workbooks: 
1. Jethro: Jethro sd – save 
2. Impala: Impala sd save
Side-By-Side Implementation 
Jethro Query 
Node 
MapReduce / Impala 
▪ 
Jethro Indexer 
Existing 
Hadoop tables 
are untouched 
Data 
Stream 
Jethro Query 
Nodes 
▪ 
▪ 
▪ 
▪ ▪ 
▪▪▪ ▪▪ 
▪ 
▪▪ ▪▪ 
▪▪ ▪ 
▪ 
▪ ▪ 
Hive / Pig 
▪ ▪ ▪ ▪ 
▪▪ 
▪▪▪ 
BI Tools 
SQL 
▪ 
Indexes are 
added to select 
tables. ~30% 
incremental 
storage
1. Installing JethroData 
• Existing Hadoop cluster 
– CDH 4.x, CDH 5.x, HDP 2.x, EMR 3.x 
• Designated Jethro server 
– Can be inside or outside the cluster 
– HW: CPU: 16+ cores, Mem: 64GB+, Net: 1GB / 10GB, SSD 
for cache 
• Install Jethro – download package 
– rpm install 
– Install HDFS client (if needed) 
– Create /jethro dir in HDFS 
• Start Jethro 
– service jethro start
2. Load Data into Jethro 
• Run create instance script 
– JethroAdmin create-instance demo /Jethro/demo 
• Define a new table 
– JethroClient demo localhost 9111 
• Create table sales_demo (…); 
• Run JethroLoader process 
– JethroLoader demo sales_demo.desc 
sales_demo.csv & 
• Start Querying – ODBC, JDBC, JethroClient 
That’s it!
Road Map 
• Jethro S3 
• Analytic functions 
• UDF 
• Cascading optimizer 
• Function indexes 
• Light weight text search 
• Rows group format (Parquet/ORC) 
• Integration with YARN for resource management 
• Sync with Hive Metastore/HCatalog 
• Nested data 
• Materialized views 
• Distributes query
Functional Indexes 
Problem 
How to accelerate this query: 
Select count(*) from T where year(birthdate)=2007; 
Solution 
• Function index created for commonly 
used functions 
• Some function indexes are automatically 
created for specific data types. Example: 
year function for timestamp 
• Query optimizer will identify scenarios 
where functional indexes should be used 
• Function index can also be user defined 
or created on the fly via adaptive 
optimization 
Base Index Function YAER index 
Value Rows Value Rows 
02/04/2007 5,9,10,11,14 2007 
1,2,3,4,5,6,7, 
8,9,10,11,12, 
13,14,15 
03/05/2007 1,3,7,12,13 
10/10/2007 2,4,6,8,15 
01/02/2008 15,18 2008 15,16,17,18 
05/03/2008 16,17 
Query uses function index examples: 
year(c1)=2007  explicit use for year index 
C1 between 2007-01-01 and 2007-12-31  implicit use for year index 
C1 between 2007-01-01 and 2008-02-15  Mix: take year index for 2007 and 
rest from base index
Jethro’s Benefits 
Simple to use 
• Implemented side by 
side with existing 
Hadoop system 
• Access via SQL or 
your favorite BI tool 
• Integrates with 
Hadoop eco system 
10X Faster queries 
• Interactive analysis 
with Sub second 
latency 
• Access to data as it 
arrives 
• Analyze granular, 
raw data 
50% Cheaper to operate 
• Significantly less 
computing resources 
• No dual systems, costly 
ETL 
• Elastically scalable on 
commodity hardware
Try it Today 
• Point browser at: 
http://www.jethrodata.com/home 
• Click 
• Register:
Jethro – Big Data Analytics. Real-Time. 
Thank You!

More Related Content

What's hot

SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
markgrover
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
Cloudera, Inc.
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
James Serra
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop
Cloudera, Inc.
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 

What's hot (20)

SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 

Viewers also liked

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Discover malaysia with kualawww. Tripmart.com
Discover malaysia with kualawww. Tripmart.comDiscover malaysia with kualawww. Tripmart.com
Discover malaysia with kualawww. Tripmart.com
tripmart
 
Фестиваль открытых уроков
Фестиваль открытых уроковФестиваль открытых уроков
Фестиваль открытых уроковkillaruns
 
Doğan sivrikaya individual presentation spain
Doğan sivrikaya individual presentation spainDoğan sivrikaya individual presentation spain
Doğan sivrikaya individual presentation spain
dogansivrikaya
 
4.drama & horror
4.drama & horror4.drama & horror
4.drama & horrorgia1995
 
Chapter 0 introduction
Chapter 0   introductionChapter 0   introduction
Chapter 0 introductiondantares
 
2011 2012-121115033020-phpapp02
2011 2012-121115033020-phpapp022011 2012-121115033020-phpapp02
2011 2012-121115033020-phpapp02polemic
 
Studio E_Co-Busseto_Patto dei Sindaci28112012
Studio E_Co-Busseto_Patto dei Sindaci28112012Studio E_Co-Busseto_Patto dei Sindaci28112012
Studio E_Co-Busseto_Patto dei Sindaci28112012
Sara Chiussi
 
The great australian tour www.tripmart.com
The great australian tour www.tripmart.comThe great australian tour www.tripmart.com
The great australian tour www.tripmart.com
tripmart
 
Ebook
EbookEbook
Ebook
adityak48
 
Learning organisations and design thinking
Learning organisations and design thinkingLearning organisations and design thinking
Learning organisations and design thinkingemilia åström
 
Explore europewww.Tripmart.com
  Explore europewww.Tripmart.com  Explore europewww.Tripmart.com
Explore europewww.Tripmart.com
tripmart
 
Formulario historia medica
Formulario historia medicaFormulario historia medica
Formulario historia medicaangiedaiana
 
Project presentation - Romania
Project presentation - RomaniaProject presentation - Romania
Project presentation - Romaniaprimariacatunele
 
Studmuffin media
Studmuffin mediaStudmuffin media
Studmuffin media
Studmuffin Media
 
3 3 Core Skill 3: Unlock the Circuit
3 3  Core Skill 3: Unlock the Circuit3 3  Core Skill 3: Unlock the Circuit
3 3 Core Skill 3: Unlock the CircuitJoe Mellin
 

Viewers also liked (20)

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Discover malaysia with kualawww. Tripmart.com
Discover malaysia with kualawww. Tripmart.comDiscover malaysia with kualawww. Tripmart.com
Discover malaysia with kualawww. Tripmart.com
 
Фестиваль открытых уроков
Фестиваль открытых уроковФестиваль открытых уроков
Фестиваль открытых уроков
 
We indian
We indianWe indian
We indian
 
Doğan sivrikaya individual presentation spain
Doğan sivrikaya individual presentation spainDoğan sivrikaya individual presentation spain
Doğan sivrikaya individual presentation spain
 
Chavez gerogina
Chavez geroginaChavez gerogina
Chavez gerogina
 
4.drama & horror
4.drama & horror4.drama & horror
4.drama & horror
 
Chapter 0 introduction
Chapter 0   introductionChapter 0   introduction
Chapter 0 introduction
 
Ituren eta zubieta2
Ituren eta zubieta2Ituren eta zubieta2
Ituren eta zubieta2
 
2011 2012-121115033020-phpapp02
2011 2012-121115033020-phpapp022011 2012-121115033020-phpapp02
2011 2012-121115033020-phpapp02
 
Studio E_Co-Busseto_Patto dei Sindaci28112012
Studio E_Co-Busseto_Patto dei Sindaci28112012Studio E_Co-Busseto_Patto dei Sindaci28112012
Studio E_Co-Busseto_Patto dei Sindaci28112012
 
The great australian tour www.tripmart.com
The great australian tour www.tripmart.comThe great australian tour www.tripmart.com
The great australian tour www.tripmart.com
 
Ebook
EbookEbook
Ebook
 
Learning organisations and design thinking
Learning organisations and design thinkingLearning organisations and design thinking
Learning organisations and design thinking
 
Explore europewww.Tripmart.com
  Explore europewww.Tripmart.com  Explore europewww.Tripmart.com
Explore europewww.Tripmart.com
 
Formulario historia medica
Formulario historia medicaFormulario historia medica
Formulario historia medica
 
Project presentation - Romania
Project presentation - RomaniaProject presentation - Romania
Project presentation - Romania
 
Studmuffin media
Studmuffin mediaStudmuffin media
Studmuffin media
 
3 3 Core Skill 3: Unlock the Circuit
3 3  Core Skill 3: Unlock the Circuit3 3  Core Skill 3: Unlock the Circuit
3 3 Core Skill 3: Unlock the Circuit
 

Similar to Jethro data meetup index base sql on hadoop - oct-2014

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Remy Rosenbaum
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c

Similar to Jethro data meetup index base sql on hadoop - oct-2014 (20)

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
DB
DBDB
DB
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c
An AMIS overview of database 12c
 

Recently uploaded

Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 

Recently uploaded (20)

Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 

Jethro data meetup index base sql on hadoop - oct-2014

  • 1. JethroData Indexed Based SQL-on- Hadoop - An Architectural Comparison of Tools Simpler. Faster. Cheaper.
  • 2. About the presenter Boaz Raufman – Co-Founder / CTO • Over 25 years experience in software design & mgmt • Expertise in database architecture, information retrieval and search technologies • Led numerous information retrieval projects for various Israeli intelligence agencies as well as for commercial companies • Started JethroData in 2010 with the idea of integrating database and search technologies to accelerate big data analytics • Bachelor's degree in Computer Science and Philosophy from the Tel-Aviv University
  • 3. SQL-on-Hadoop Hadoop uses the same parallel design pattern as the parallel databases from last decade Frameworks • MapReduce • Tez • Spark Reborn on Hadoop • Pivotal HAWQ • IBM BigSQL • Teradata Aster • Actian New Comers • Hive • Impala • Presto • Tajo • Drill • Spark SQL
  • 4. Data Node Full-Scan Execution Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day Data Node Data Node Data Node Data Node Query Executor Query Executor Query Executor Query Executor Query Executor Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Performance and resources based on the size of the dataset
  • 5. Shared-Nothing MPP Design Principles Parallel Processing • Divide the work across many nodes • Try to minimize inter-node communication • Work should be evenly distributed Full Data Scanning • Full sequential scan - massive I/O • Data locality and local processing • Minimize amount of data being read – Columnar data store – Partition by specific key – Block stats Performance and resource requirement based on the dataset size
  • 6. MPP Complex queries processing Result Merge Global Aggregation (join, distinct, group by, order by, sub-query) Local Aggregation Example: SELECT DAY, COUNT( DISTINCT ITEM) FROM T1 WHERE PRODUCT=‘abc’ GROUP BY DAY
  • 8. Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day Data Node The Index-Access Design Data Node Data Node Data Node Data Node Jethro Query Node Query Node 1. Index Access 2. Read data only for require rows Performance and resources based on the size of the result-set
  • 9. Index-Based Design • Surgical scan – minimum I/O • Performance and required resources based on the result set size • Extremely efficient for Interactive SQL use cases • Pay at load time
  • 10. Architecture – Contrarian Concepts • Index everything – Every column is indexed • Colum oriented – Columnar or row-groups – Append only data model • Everything is stored in HDFS – Can also work with S3 or Posix • Shared Everything – Separate compute and storage, each scales-out independently – Minimize cross-node operations – Stateless Query nodes • Parallelized multi-threaded execution – multiple parallelization dimensions: columns, row ranges, partitions, pipelining and bucketing Jethro Node Jethro Node Client Processing Layer Storage Layer HDFS/Posix/S3
  • 11. Jethro Indexes  Indexes map each column value to a list of rows  Jethro stores indexes as Value Rows FR rows 5,9,10,11,14 IL rows 1,3,7,12,13 US rows 2,4,6,8,15 hierarchical compressed bitmaps  Very fast query operations – AND / OR / NOT  Processed the entire WHERE clause to a final list of rows  Patent pending: http://www.google.com/patents/WO2013001535A3?cl=en  INSERT Performance – Load is very fast: files are appended, no random read/write, no locks – Jethro Indexes are append-only. If needed, duplicate entries are allowed – Periodic background merge (non-blocking) – Compatible with HDFS
  • 12. Built in optimizations • Code is written in C++ • Column store and true column processing • Vectorization for expression evaluation • Multi-threaded and parallelized execution • Planner using indexes meta data - index-based queries • Server-side cache in memory and local disk
  • 13. Use-Case Analysis Full-Scan: Performance depends on size of dataset Index-Access: Performance depends on size of result-set
  • 14. Comparing Recent Benchmarks – Jun 2014 Impala Parquet Vs. Hive/Tez, Presto, Shark Source Jethro Vs. Impala/Parquet Source Impala Using the same queries in Jun-2014 Impala benchmarks, we compared Impala with Jethro (TPC-DS, SF 1,000)
  • 15. Benchmark – Jethro vs. Impala – Oct 2014 103 TPC-DS Interactive queries Oct 2014 39.8 39.4 39.9 73.4 188.4 84.3 85.2 6.4 5 4.9 4 12.3 11.7 10.3 4.4 200 180 160 140 120 100 80 60 40 20 0 q19 q42 q52 q55 q63 q68 q73 q98 Impala 1.4.2 Jethro 0.9 * Queries use original TPC-DS filter criteria *
  • 16. DEMO Go to Tableau demo 1. Point browser at: http://54.245.114.83/ 2. Login as try-jethro/jethro123 3. Edit workbooks: 1. Jethro: Jethro sd – save 2. Impala: Impala sd save
  • 17. Side-By-Side Implementation Jethro Query Node MapReduce / Impala ▪ Jethro Indexer Existing Hadoop tables are untouched Data Stream Jethro Query Nodes ▪ ▪ ▪ ▪ ▪ ▪▪▪ ▪▪ ▪ ▪▪ ▪▪ ▪▪ ▪ ▪ ▪ ▪ Hive / Pig ▪ ▪ ▪ ▪ ▪▪ ▪▪▪ BI Tools SQL ▪ Indexes are added to select tables. ~30% incremental storage
  • 18. 1. Installing JethroData • Existing Hadoop cluster – CDH 4.x, CDH 5.x, HDP 2.x, EMR 3.x • Designated Jethro server – Can be inside or outside the cluster – HW: CPU: 16+ cores, Mem: 64GB+, Net: 1GB / 10GB, SSD for cache • Install Jethro – download package – rpm install – Install HDFS client (if needed) – Create /jethro dir in HDFS • Start Jethro – service jethro start
  • 19. 2. Load Data into Jethro • Run create instance script – JethroAdmin create-instance demo /Jethro/demo • Define a new table – JethroClient demo localhost 9111 • Create table sales_demo (…); • Run JethroLoader process – JethroLoader demo sales_demo.desc sales_demo.csv & • Start Querying – ODBC, JDBC, JethroClient That’s it!
  • 20. Road Map • Jethro S3 • Analytic functions • UDF • Cascading optimizer • Function indexes • Light weight text search • Rows group format (Parquet/ORC) • Integration with YARN for resource management • Sync with Hive Metastore/HCatalog • Nested data • Materialized views • Distributes query
  • 21. Functional Indexes Problem How to accelerate this query: Select count(*) from T where year(birthdate)=2007; Solution • Function index created for commonly used functions • Some function indexes are automatically created for specific data types. Example: year function for timestamp • Query optimizer will identify scenarios where functional indexes should be used • Function index can also be user defined or created on the fly via adaptive optimization Base Index Function YAER index Value Rows Value Rows 02/04/2007 5,9,10,11,14 2007 1,2,3,4,5,6,7, 8,9,10,11,12, 13,14,15 03/05/2007 1,3,7,12,13 10/10/2007 2,4,6,8,15 01/02/2008 15,18 2008 15,16,17,18 05/03/2008 16,17 Query uses function index examples: year(c1)=2007  explicit use for year index C1 between 2007-01-01 and 2007-12-31  implicit use for year index C1 between 2007-01-01 and 2008-02-15  Mix: take year index for 2007 and rest from base index
  • 22. Jethro’s Benefits Simple to use • Implemented side by side with existing Hadoop system • Access via SQL or your favorite BI tool • Integrates with Hadoop eco system 10X Faster queries • Interactive analysis with Sub second latency • Access to data as it arrives • Analyze granular, raw data 50% Cheaper to operate • Significantly less computing resources • No dual systems, costly ETL • Elastically scalable on commodity hardware
  • 23. Try it Today • Point browser at: http://www.jethrodata.com/home • Click • Register:
  • 24. Jethro – Big Data Analytics. Real-Time. Thank You!