December 2013 HUG: InfiniDB for Hadoop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

December 2013 HUG: InfiniDB for Hadoop

on

  • 1,758 views

December 2013 HUG: InfiniDB for Hadoop

December 2013 HUG: InfiniDB for Hadoop

Statistics

Views

Total Views
1,758
Views on SlideShare
1,758
Embed Views
0

Actions

Likes
1
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

December 2013 HUG: InfiniDB for Hadoop Presentation Transcript

  • 1. Bay Area Hadoop Users Group Turning the Tables with InfiniDB for Hadoop December 18, 2013
  • 2. Agenda  InfiniDB Background  InfiniDB Technical Foundations  Parallelism  Partitioning Model  Additional I/O Efficiencies  (My)SQL for Hadoop  When to use Columnar/InfiniDB for Hadoop  InfiniDB Benchmarks Copyright © 2013 Calpont. All Rights Reserved.
  • 3. InfiniDB Background Platforms Versions  InfiniDB  InfiniDB Launched Feb 2010  InfiniDB for the Cloud  InfiniDB 4 – latest release available October 2013  InfiniDB for Hadoop  Added InfiniDB for Hadoop  Source code at https://github.com/infinidb  GPL v2  No restrictions on syntax, scale, or performance Copyright © 2013 Calpont. All Rights Reserved.
  • 4. InfiniDB Background - Customer Base Copyright © 2013 Calpont. All Rights Reserved.
  • 5. InfiniDB Background Platforms  InfiniDB Local Disk, GlusterFS, Windows*  http://www.calpont.com/products/tryinfinidb  InfiniDB for Hadoop CDH or HDP  http://www.calpont.com/products/tryinfinidb  InfiniDB for the Cloud Any availability zone Copyright © 2013 Calpont. All Rights Reserved.
  • 6. InfiniDB Background – InfiniDB for Hadoop  InfiniDB is a non-map/reduce engine  Reads and writes natively to HDFS Pig/Hive HBase Map Reduce InfiniDB for Hadoop Hadoop Distributed File System 6
  • 7. InfiniDB Background - InfiniDB for Hadoop Is InfiniDB a Database? “InfiniDB turns SQL developers …not a General Purpose DBMS. into Big Data developers. We deployed it quickly and easily Is InfiniDB NoSQL? for our online sales analytics. … only in the sense that we discarded Something we couldn’t do traditional DBMS architectures. with Hadoop, Mongo, or Teradata” Is InfiniDB an SQL for Hadoop technology? … Yes, but not general purpose SQL. InfiniDB is highly optimized for analytic workloads/queries. 7
  • 8. InfiniDB Foundation - Parallelism • User Module – Processes SQL Requests • Performance Module – Executes the Queries Single Server MPP or Local disk / EBS GlusterFS / HDFS 8
  • 9. InfiniDB Foundation - Parallelism •Purpose-built C++ engine •Parallelism is at the thread level •Example: 12 PM Servers with 8 cores each yields 96 parallel processing engines. •SQL is translated into thousands or tens of thousands of discrete jobs or “primitives”. •The UM sends primitives to the processing engines. 9
  • 10. InfiniDB Foundation - Parallelism •User Module – Processes SQL Requests •Performance Module – Executes the Queries Single Server MPP • Primitives are issued to thread queue within PM • Fixed thread count at PM Local disk / EBS GlusterFS / HDFS 10
  • 11. Fully Parallel SQL + Full SQL Syntax DoW Reduce  SQL Operations are translated into thousands of jobs via custom Distribution of Work: • Parallel/Distributed Data Access • Parallel/Distributed Joins (Inner, Outer) • Parallel/Distributed Sub-queries (From, Where, Select) • Parallel/Distributed Group By, Distinct, and Aggregation • Extensible with Parallel/Distributed User Defined Functions Results are returned to User Module in Reduce Phase 11
  • 12. InfiniDB Data Partitioning 2-Dimensional Partitioning Model •Vertical Partitioning by Column o Not Column-Family (no relation to HBase) o Only do I/O for columns requested •Horizontal Partitioning by range of rows o Meta-data stored within in-memory structure 12
  • 13. InfiniDB Data Partitioning •Partition elimination can occur based on: o Columns not included in SQL. o Based on filter expressed within query. o Based on filter expressed on a join table: Table1 filter can drive Table2 I/O elimination o Intersection between filters: Filter1 and Filter2 does I/O on intersection 13
  • 14. Column Restriction and Projection |-------- Column # Seventeen -----------| Extent # 27 Filter 3 Filter 2 Filter 1 |-------------- Column # Six ---------------| |-------------- Column # Four ---------------| Projection Extent # 5 Projection • Automatic Vertical Partitioning + Horizontal Partitioning • Just-In-Time Materialization 14
  • 15. Additional I/O Efficiency Techniques to Avoid Unnecessary I/O  Vertical Partitioning: read only the columns required  Horizontal Partition: focus on the rows required  Just-in-time materialization Techniques for Efficient I/O  Columnar compression reduces I/O from disk  Global data buffer cache can reduce disk I/O (in-memory)  Avoidance of Random I/O 15
  • 16. InfiniDB Design Principles ® Scalable Fast 16 Simple
  • 17. (My)SQL for Hadoop - Engine=InfiniDB InfiniDB uses standard “Engine=InfiniDB” syntax: CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=InfiniDB; 17
  • 18. (My)SQL for Hadoop Leverage existing tools that connect to MySQL Expose Structured Data to the Business Familiar User Privilege Administration MicroStrategy JasperSoft Pentaho MySQL ease of use + Hadoop Scale + Columnar Performance 18
  • 19. Syntax Support Broad MySQL SQL syntax - + Analytic/windowing functions included with InfiniDB 4 No indexing needed. Partitioning is automatic. InfiniDB Supported Syntax 19
  • 20. When to Use InfiniDB for Hadoop Query Size (Vision/Scope) defines workloads: 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Size/Vision/Scope OLTP/NoSQL Workloads ROLAP/Analytic/Reporting Workloads General purpose DBMS missed the target ( dated database technology generally not optimal ) 20
  • 21. What is your typical query? 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Vision/Scope OLTP/NoSQL Workloads Analytic Workloads • There is no “average” query. • The challenges are at the extremes: o The challenge of high concurrency levels with small queries. o The challenge of latency for very large queries. • Most use cases imply multiple data technologies. 21
  • 22. Columnar Appropriate Workloads 1 100 10,000 1,000,000 100,000,000 10,000,000,000 Query Vision/Scope OLTP/NoSQL Workloads Pure Columnar about 10x worse I/O for single record lookups 22 ROLAP/Analytic/Reporting Workloads Pure Columnar about 10x better I/O for large data access patterns
  • 23. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Unstructured Data Schema on read Schema on write Small Queries Large Queries Transform (ETL) Targeted Extract Pre-defined queries 23 Structured Ad-hoc queries
  • 24. InfiniDB Query Performance – Percona Star Schema Benchmark (SSB) Q5 Series 5 table Joins Q1 Series 2 table Joins Q2 Series 3 table Joins Q3 Series 4 table Joins 24
  • 25. 1000 Genomes Data Set – 289 Billion Rows  Fast load Rate  Millions rows/sec  Billions rows/hour  Scalable load rate 1000 Genomes data set on AWS
  • 26. 1000 Genomes Data Set – ~ 24 trillion base nucleotide values Scaling: 4 –> 8 –> 16 Performance Modules  Fast Analytics  Millions of rows/second  Scalable Analytics Seconds per core  Automatic parallelism Performance Modules (PMs) Active Figure 2 - TATA Binding Protein Source: http://en.wikipedia.org/wiki/TATA_binding_protein
  • 27. Impala-InfiniDB Benchmark (Piwik Data Set) InfiniDB Figure 1 - Piwik Standard Query Performance InfiniDB Figure 2 - Piwik Ad-Hoc Query Performance Piwik is an Open Source alternative to Google Analytics Queries 1-6 offered are Piwik production queries Queries 7-9 are additional ad-hoc queries covering all data Amazon 5-node cluster
  • 28. Columnar Appropriate Workloads Data Dimensions and InfiniDB for Hadoop Structured Schema on read InfiniDB Schema on write Small Queries Large Queries Transform (ETL) Targeted Extract Figure 2 - Piwik Ad-Hoc Query Performance Ad-hoc queries 28
  • 29. Download Today InfiniDB and InfiniDB for Hadoop: www.calpont.com InfiniDB for the Cloud: InfiniDB AMI in any AWS Availability Zone/Region Services Inquiries: sales@calpont.com Twitter: @InfiniDB @jtommaney © 2013 Calpont Corporation. Calpont, the Calpont logo, InfiniDB, and the InfiniDB logo are trademarks of Calpont Corporation. AWS is a trademark of Amazon.com, Inc., and Apache Hadoop is a trademark of the Apache Software Foundation. Other product names and logos may be trademarks of their respective owners. 29