December 2013 HUG: InfiniDB for Hadoop

2,514 views
2,258 views

Published on

December 2013 HUG: InfiniDB for Hadoop

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,514
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
37
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

December 2013 HUG: InfiniDB for Hadoop

  1. 1. Bay Area Hadoop Users Group Turning the Tables with InfiniDB for Hadoop December 18, 2013
  2. 2. Agenda  InfiniDB Background  InfiniDB Technical Foundations  Parallelism  Partitioning Model  Additional I/O Efficiencies  (My)SQL for Hadoop  When to use Columnar/InfiniDB for Hadoop  InfiniDB Benchmarks Copyright © 2013 Calpont. All Rights Reserved.
  3. 3. InfiniDB Background Platforms Versions  InfiniDB  InfiniDB Launched Feb 2010  InfiniDB for the Cloud  InfiniDB 4 – latest release available October 2013  InfiniDB for Hadoop  Added InfiniDB for Hadoop  Source code at https://github.com/infinidb  GPL v2  No restrictions on syntax, scale, or performance Copyright © 2013 Calpont. All Rights Reserved.
  4. 4. InfiniDB Background - Customer Base Copyright © 2013 Calpont. All Rights Reserved.
  5. 5. InfiniDB Background Platforms  InfiniDB Local Disk, GlusterFS, Windows*  http://www.calpont.com/products/tryinfinidb  InfiniDB for Hadoop CDH or HDP  http://www.calpont.com/products/tryinfinidb  InfiniDB for the Cloud Any availability zone Copyright © 2013 Calpont. All Rights Reserved.
  6. 6. InfiniDB Background – InfiniDB for Hadoop  InfiniDB is a non-map/reduce engine  Reads and writes natively to HDFS Pig/Hive HBase Map Reduce InfiniDB for Hadoop Hadoop Distributed File System 6
  7. 7. InfiniDB Background - InfiniDB for Hadoop Is InfiniDB a Database? “InfiniDB turns SQL developers …not a General Purpose DBMS. into Big Data developers. We deployed it quickly and easily Is InfiniDB NoSQL? for our online sales analytics. … only in the sense that we discarded Something we couldn’t do traditional DBMS architectures. with Hadoop, Mongo, or Teradata” Is InfiniDB an SQL for Hadoop technology? … Yes, but not general purpose SQL. InfiniDB is highly optimized for analytic workloads/queries. 7
  8. 8. InfiniDB Foundation - Parallelism • User Module – Processes SQL Requests • Performance Module – Executes the Queries Single Server MPP or Local disk / EBS GlusterFS / HDFS 8
  9. 9. InfiniDB Foundation - Parallelism •Purpose-built C++ engine •Parallelism is at the thread level •Example: 12 PM Servers with 8 cores each yields 96 parallel processing engines. •SQL is translated into thousands or tens of thousands of discrete jobs or “primitives”. •The UM sends primitives to the processing engines. 9
  10. 10. InfiniDB Foundation - Parallelism •User Module – Processes SQL Requests •Performance Module – Executes the Queries Single Server MPP • Primitives are issued to thread queue within PM • Fixed thread count at PM Local disk / EBS GlusterFS / HDFS 10
  11. 11. Fully Parallel SQL + Full SQL Syntax DoW Reduce  SQL Operations are translated into thousands of jobs via custom Distribution of Work: • Parallel/Distributed Data Access • Parallel/Distributed Joins (Inner, Outer) • Parallel/Distributed Sub-queries (From, Where, Select) • Parallel/Distributed Group By, Distinct, and Aggregation • Extensible with Parallel/Distributed User Defined Functions Results are returned to User Module in Reduce Phase 11
  12. 12. InfiniDB Data Partitioning 2-Dimensional Partitioning Model •Vertical Partitioning by Column o Not Column-Family (no relation to HBase) o Only do I/O for columns requested •Horizontal Partitioning by range of rows o Meta-data stored within in-memory structure 12
  13. 13. InfiniDB Data Partitioning •Partition elimination can occur based on: o Columns not included in SQL. o Based on filter expressed within query. o Based on filter expressed on a join table: Table1 filter can drive Table2 I/O elimination o Intersection between filters: Filter1 and Filter2 does I/O on intersection 13
  14. 14. Column Restriction and Projection |-------- Column # Seventeen -----------| Extent # 27 Filter 3 Filter 2 Filter 1 |-------------- Column # Six ---------------| |-------------- Column # Four ---------------| Projection Extent # 5 Projection • Automatic Vertical Partitioning + Horizontal Partitioning • Just-In-Time Materialization 14
  15. 15. Additional I/O Efficiency Techniques to Avoid Unnecessary I/O  Vertical Partitioning: read only the columns required  Horizontal Partition: focus on the rows required  Just-in-time materialization Techniques for Efficient I/O  Columnar compression reduces I/O from disk  Global data buffer cache can reduce disk I/O (in-memory)  Avoidance of Random I/O 15
  16. 16. InfiniDB Design Principles ® Scalable Fast 16 Simple
  17. 17. (My)SQL for Hadoop - Engine=InfiniDB InfiniDB uses standard “Engine=InfiniDB” syntax: CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=InfiniDB; 17
  18. 18. (My)SQL for Hadoop Leverage existing tools that connect to MySQL Expose Structured Data to the Business Familiar User Privilege Administration MicroStrategy JasperSoft Pentaho MySQL ease of use + Hadoop Scale + Columnar Performance 18
  19. 19. Syntax Support Broad MySQL SQL syntax - + Analytic/windowing functions included with InfiniDB 4 No indexing needed. Partitioning is automatic. InfiniDB Supported Syntax 19
  20. 20. When to Use InfiniDB for Hadoop Query Size (Vision/Scope) defines workloads: 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Size/Vision/Scope OLTP/NoSQL Workloads ROLAP/Analytic/Reporting Workloads General purpose DBMS missed the target ( dated database technology generally not optimal ) 20
  21. 21. What is your typical query? 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Vision/Scope OLTP/NoSQL Workloads Analytic Workloads • There is no “average” query. • The challenges are at the extremes: o The challenge of high concurrency levels with small queries. o The challenge of latency for very large queries. • Most use cases imply multiple data technologies. 21
  22. 22. Columnar Appropriate Workloads 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Vision/Scope OLTP/NoSQL Workloads Pure Columnar about 10x worse I/O for single record lookups 22 ROLAP/Analytic/Reporting Workloads Pure Columnar about 10x better I/O for large data access patterns
  23. 23. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Unstructured Data Schema on read Schema on write Small Queries Large Queries Transform (ETL) Targeted Extract Pre-defined queries 23 Structured Ad-hoc queries
  24. 24. InfiniDB Query Performance – Percona Star Schema Benchmark (SSB) Q5 Series 5 table Joins Q1 Series 2 table Joins Q2 Series 3 table Joins Q3 Series 4 table Joins 24
  25. 25. 1000 Genomes Data Set – 289 Billion Rows  Fast load Rate  Millions rows/sec  Billions rows/hour  Scalable load rate 1000 Genomes data set on AWS
  26. 26. 1000 Genomes Data Set – ~ 24 trillion base nucleotide values Scaling: 4 –> 8 –> 16 Performance Modules  Fast Analytics  Millions of rows/second  Scalable Analytics Seconds per core  Automatic parallelism Performance Modules (PMs) Active Figure 2 - TATA Binding Protein Source: http://en.wikipedia.org/wiki/TATA_binding_protein
  27. 27. Impala-InfiniDB Benchmark (Piwik Data Set) InfiniDB Figure 1 - Piwik Standard Query Performance InfiniDB Figure 2 - Piwik Ad-Hoc Query Performance Piwik is an Open Source alternative to Google Analytics Queries 1-6 offered are Piwik production queries Queries 7-9 are additional ad-hoc queries covering all data Amazon 5-node cluster
  28. 28. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Structured Schema on read InfiniDB Schema on write Small Queries Large Queries Transform (ETL) Targeted Extract Figure 2 - Piwik Ad-Hoc Query Performance Ad-hoc queries 28
  29. 29. Download Today InfiniDB and InfiniDB for Hadoop: www.calpont.com InfiniDB for the Cloud: InfiniDB AMI in any AWS Availability Zone/Region Services Inquiries: sales@calpont.com Twitter: @InfiniDB @jtommaney © 2013 Calpont Corporation. Calpont, the Calpont logo, InfiniDB, and the InfiniDB logo are trademarks of Calpont Corporation. AWS is a trademark of Amazon.com, Inc., and Apache Hadoop is a trademark of the Apache Software Foundation. Other product names and logos may be trademarks of their respective owners. 29

×