Your SlideShare is downloading. ×
0
Bay Area Hadoop Users Group
Turning the Tables with InfiniDB for
Hadoop
December 18, 2013
Agenda
 InfiniDB Background
 InfiniDB Technical Foundations
 Parallelism
 Partitioning Model
 Additional I/O Efficien...
InfiniDB Background
Platforms

Versions

 InfiniDB

 InfiniDB Launched Feb 2010

 InfiniDB for the Cloud

 InfiniDB 4 ...
InfiniDB Background - Customer Base

Copyright © 2013 Calpont. All Rights Reserved.
InfiniDB Background
Platforms
 InfiniDB

Local Disk, GlusterFS, Windows*

 http://www.calpont.com/products/tryinfinidb

...
InfiniDB Background – InfiniDB for Hadoop
 InfiniDB is a non-map/reduce engine
 Reads and writes natively to HDFS

Pig/H...
InfiniDB Background - InfiniDB for Hadoop
Is InfiniDB a Database?
“InfiniDB turns SQL developers

…not a General Purpose D...
InfiniDB Foundation - Parallelism
• User Module – Processes SQL Requests
• Performance Module – Executes the Queries
Singl...
InfiniDB Foundation - Parallelism
•Purpose-built C++ engine
•Parallelism is at the thread level
•Example: 12 PM Servers wi...
InfiniDB Foundation - Parallelism
•User Module – Processes SQL Requests
•Performance Module – Executes the Queries
Single ...
Fully Parallel SQL + Full SQL Syntax

DoW

Reduce 

SQL Operations are translated into thousands of jobs via cus...
InfiniDB Data Partitioning
2-Dimensional Partitioning Model
•Vertical Partitioning by Column
o Not Column-Family (no relat...
InfiniDB Data Partitioning
•Partition elimination can occur based on:
o Columns not included in SQL.
o Based on filter exp...
Column Restriction and Projection
|-------- Column # Seventeen -----------|

Extent # 27

Filter 3

Filter 2

Filter 1

|-...
Additional I/O Efficiency
Techniques to Avoid Unnecessary I/O
 Vertical Partitioning: read only the columns required

 H...
InfiniDB Design Principles
®

Scalable

Fast

16

Simple
(My)SQL for Hadoop - Engine=InfiniDB
InfiniDB uses standard “Engine=InfiniDB” syntax:

CREATE TABLE `game_warehouse`.`dim_...
(My)SQL for Hadoop
Leverage existing tools
that connect to
MySQL

Expose Structured
Data to the Business

Familiar User Pr...
Syntax Support

Broad MySQL
SQL syntax

-

+

Analytic/windowing
functions included
with InfiniDB 4

No indexing needed.
P...
When to Use InfiniDB for Hadoop

Query Size (Vision/Scope) defines workloads:
1

100 10,000

1,000,000

100,000,000 10,000...
What is your typical query?
1

100 10,000

1,000,000

100,000,000 10,000,000,000

Query Vision/Scope

OLTP/NoSQL Workloads...
Columnar Appropriate Workloads
1

100 10,000

1,000,000

100,000,000 10,000,000,000

Query Vision/Scope

OLTP/NoSQL Worklo...
Columnar Appropriate Workloads
Data Dimensions and InfiniDB for Hadoop
Unstructured Data
Schema on read

Schema on write

...
InfiniDB Query Performance – Percona
Star Schema Benchmark (SSB)
Q5 Series
5 table Joins

Q1 Series
2 table Joins

Q2 Seri...
1000 Genomes Data Set – 289 Billion Rows
 Fast load Rate
 Millions rows/sec
 Billions rows/hour

 Scalable load rate

...
1000 Genomes Data Set – ~ 24 trillion base
nucleotide values
Scaling: 4 –> 8 –> 16 Performance Modules

 Fast Analytics
...
Impala-InfiniDB Benchmark (Piwik Data Set)

InfiniDB

Figure 1 - Piwik Standard Query Performance

InfiniDB

Figure 2 - Pi...
Columnar Appropriate Workloads
Data Dimensions and InfiniDB for Hadoop
Structured
Schema on read

InfiniDB

Schema on writ...
Download Today
InfiniDB and InfiniDB for Hadoop:
www.calpont.com
InfiniDB for the Cloud:
InfiniDB AMI in any AWS Availabil...
Upcoming SlideShare
Loading in...5
×

December 2013 HUG: InfiniDB for Hadoop

1,960

Published on

December 2013 HUG: InfiniDB for Hadoop

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,960
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "December 2013 HUG: InfiniDB for Hadoop"

  1. 1. Bay Area Hadoop Users Group Turning the Tables with InfiniDB for Hadoop December 18, 2013
  2. 2. Agenda  InfiniDB Background  InfiniDB Technical Foundations  Parallelism  Partitioning Model  Additional I/O Efficiencies  (My)SQL for Hadoop  When to use Columnar/InfiniDB for Hadoop  InfiniDB Benchmarks Copyright © 2013 Calpont. All Rights Reserved.
  3. 3. InfiniDB Background Platforms Versions  InfiniDB  InfiniDB Launched Feb 2010  InfiniDB for the Cloud  InfiniDB 4 – latest release available October 2013  InfiniDB for Hadoop  Added InfiniDB for Hadoop  Source code at https://github.com/infinidb  GPL v2  No restrictions on syntax, scale, or performance Copyright © 2013 Calpont. All Rights Reserved.
  4. 4. InfiniDB Background - Customer Base Copyright © 2013 Calpont. All Rights Reserved.
  5. 5. InfiniDB Background Platforms  InfiniDB Local Disk, GlusterFS, Windows*  http://www.calpont.com/products/tryinfinidb  InfiniDB for Hadoop CDH or HDP  http://www.calpont.com/products/tryinfinidb  InfiniDB for the Cloud Any availability zone Copyright © 2013 Calpont. All Rights Reserved.
  6. 6. InfiniDB Background – InfiniDB for Hadoop  InfiniDB is a non-map/reduce engine  Reads and writes natively to HDFS Pig/Hive HBase Map Reduce InfiniDB for Hadoop Hadoop Distributed File System 6
  7. 7. InfiniDB Background - InfiniDB for Hadoop Is InfiniDB a Database? “InfiniDB turns SQL developers …not a General Purpose DBMS. into Big Data developers. We deployed it quickly and easily Is InfiniDB NoSQL? for our online sales analytics. … only in the sense that we discarded Something we couldn’t do traditional DBMS architectures. with Hadoop, Mongo, or Teradata” Is InfiniDB an SQL for Hadoop technology? … Yes, but not general purpose SQL. InfiniDB is highly optimized for analytic workloads/queries. 7
  8. 8. InfiniDB Foundation - Parallelism • User Module – Processes SQL Requests • Performance Module – Executes the Queries Single Server MPP or Local disk / EBS GlusterFS / HDFS 8
  9. 9. InfiniDB Foundation - Parallelism •Purpose-built C++ engine •Parallelism is at the thread level •Example: 12 PM Servers with 8 cores each yields 96 parallel processing engines. •SQL is translated into thousands or tens of thousands of discrete jobs or “primitives”. •The UM sends primitives to the processing engines. 9
  10. 10. InfiniDB Foundation - Parallelism •User Module – Processes SQL Requests •Performance Module – Executes the Queries Single Server MPP • Primitives are issued to thread queue within PM • Fixed thread count at PM Local disk / EBS GlusterFS / HDFS 10
  11. 11. Fully Parallel SQL + Full SQL Syntax DoW Reduce  SQL Operations are translated into thousands of jobs via custom Distribution of Work: • Parallel/Distributed Data Access • Parallel/Distributed Joins (Inner, Outer) • Parallel/Distributed Sub-queries (From, Where, Select) • Parallel/Distributed Group By, Distinct, and Aggregation • Extensible with Parallel/Distributed User Defined Functions Results are returned to User Module in Reduce Phase 11
  12. 12. InfiniDB Data Partitioning 2-Dimensional Partitioning Model •Vertical Partitioning by Column o Not Column-Family (no relation to HBase) o Only do I/O for columns requested •Horizontal Partitioning by range of rows o Meta-data stored within in-memory structure 12
  13. 13. InfiniDB Data Partitioning •Partition elimination can occur based on: o Columns not included in SQL. o Based on filter expressed within query. o Based on filter expressed on a join table: Table1 filter can drive Table2 I/O elimination o Intersection between filters: Filter1 and Filter2 does I/O on intersection 13
  14. 14. Column Restriction and Projection |-------- Column # Seventeen -----------| Extent # 27 Filter 3 Filter 2 Filter 1 |-------------- Column # Six ---------------| |-------------- Column # Four ---------------| Projection Extent # 5 Projection • Automatic Vertical Partitioning + Horizontal Partitioning • Just-In-Time Materialization 14
  15. 15. Additional I/O Efficiency Techniques to Avoid Unnecessary I/O  Vertical Partitioning: read only the columns required  Horizontal Partition: focus on the rows required  Just-in-time materialization Techniques for Efficient I/O  Columnar compression reduces I/O from disk  Global data buffer cache can reduce disk I/O (in-memory)  Avoidance of Random I/O 15
  16. 16. InfiniDB Design Principles ® Scalable Fast 16 Simple
  17. 17. (My)SQL for Hadoop - Engine=InfiniDB InfiniDB uses standard “Engine=InfiniDB” syntax: CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=InfiniDB; 17
  18. 18. (My)SQL for Hadoop Leverage existing tools that connect to MySQL Expose Structured Data to the Business Familiar User Privilege Administration MicroStrategy JasperSoft Pentaho MySQL ease of use + Hadoop Scale + Columnar Performance 18
  19. 19. Syntax Support Broad MySQL SQL syntax - + Analytic/windowing functions included with InfiniDB 4 No indexing needed. Partitioning is automatic. InfiniDB Supported Syntax 19
  20. 20. When to Use InfiniDB for Hadoop Query Size (Vision/Scope) defines workloads: 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Size/Vision/Scope OLTP/NoSQL Workloads ROLAP/Analytic/Reporting Workloads General purpose DBMS missed the target ( dated database technology generally not optimal ) 20
  21. 21. What is your typical query? 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Vision/Scope OLTP/NoSQL Workloads Analytic Workloads • There is no “average” query. • The challenges are at the extremes: o The challenge of high concurrency levels with small queries. o The challenge of latency for very large queries. • Most use cases imply multiple data technologies. 21
  22. 22. Columnar Appropriate Workloads 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Vision/Scope OLTP/NoSQL Workloads Pure Columnar about 10x worse I/O for single record lookups 22 ROLAP/Analytic/Reporting Workloads Pure Columnar about 10x better I/O for large data access patterns
  23. 23. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Unstructured Data Schema on read Schema on write Small Queries Large Queries Transform (ETL) Targeted Extract Pre-defined queries 23 Structured Ad-hoc queries
  24. 24. InfiniDB Query Performance – Percona Star Schema Benchmark (SSB) Q5 Series 5 table Joins Q1 Series 2 table Joins Q2 Series 3 table Joins Q3 Series 4 table Joins 24
  25. 25. 1000 Genomes Data Set – 289 Billion Rows  Fast load Rate  Millions rows/sec  Billions rows/hour  Scalable load rate 1000 Genomes data set on AWS
  26. 26. 1000 Genomes Data Set – ~ 24 trillion base nucleotide values Scaling: 4 –> 8 –> 16 Performance Modules  Fast Analytics  Millions of rows/second  Scalable Analytics Seconds per core  Automatic parallelism Performance Modules (PMs) Active Figure 2 - TATA Binding Protein Source: http://en.wikipedia.org/wiki/TATA_binding_protein
  27. 27. Impala-InfiniDB Benchmark (Piwik Data Set) InfiniDB Figure 1 - Piwik Standard Query Performance InfiniDB Figure 2 - Piwik Ad-Hoc Query Performance Piwik is an Open Source alternative to Google Analytics Queries 1-6 offered are Piwik production queries Queries 7-9 are additional ad-hoc queries covering all data Amazon 5-node cluster
  28. 28. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Structured Schema on read InfiniDB Schema on write Small Queries Large Queries Transform (ETL) Targeted Extract Figure 2 - Piwik Ad-Hoc Query Performance Ad-hoc queries 28
  29. 29. Download Today InfiniDB and InfiniDB for Hadoop: www.calpont.com InfiniDB for the Cloud: InfiniDB AMI in any AWS Availability Zone/Region Services Inquiries: sales@calpont.com Twitter: @InfiniDB @jtommaney © 2013 Calpont Corporation. Calpont, the Calpont logo, InfiniDB, and the InfiniDB logo are trademarks of Calpont Corporation. AWS is a trademark of Amazon.com, Inc., and Apache Hadoop is a trademark of the Apache Software Foundation. Other product names and logos may be trademarks of their respective owners. 29
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×