SlideShare a Scribd company logo
1 of 6
Download to read offline
Improving the performance of ad-hoc
analysis of large datasets
     True North R&D: Evaluation of Infobright Community Edition
Situation
Most organisations will have at least one data warehouse or data marts containing business data
specific to a department. These databases typically feed management information (MIS) and/or
business intelligence (BI) solutions and are in larger organisations are usually relational data stores
optimised to perform particular tasks1.

Often business users want to perform additional analysis on the data in the warehouse or mart in
order to gain insights in to customer or employee behaviour. Examples of this might be “Who are my
top 10 customers buying widgets in the following regions over the past six months?”; “Which
employees over director grade and in the IT department spend the most on employee benefits”;
“Which customers using the Safari browser who click on the Swedish landing page go on to spend
over 100 krone”.


The problem
This desire to perform ad-hoc analysis or data mining can lead to difficulties for the teams that own
and provide access to the data.

This is because data marts are usually optimised for a particular set of use cases and hence are
aggregated and indexed on the dimensions that match the use cases. So a Sales data mart may be
built to query on dimensions of product code, region, sales manager, but may not be geared up to
answer queries as to the marketing campaign code of the product. The data warehouse itself (if a
traditional warehouse) will not make any optimisations along dimensions.

For this reason, users are often discouraged or prevented from performing this type of analysis on
data warehouses. If they are allowed access there are two opposing factors:

         Long response times to ad-hoc queries lead to a poor user experience

         Database optimisations (indexes and aggregate tables) greatly increase the amount of
          storage required2


Reason for this evaluation
Several of our current clients would benefit from being able to mine their data marts in an efficient
and productive (from a user experience perspective) manner.

This document looks at a potential solution to part of that problem – in enabling efficient access to
the data both from the point of view of storage and response times.

This document was an evaluation of Infobright Community Edition (ICE) as a means of enabling ad-
hoc analysis of metric data.




1
    Smaller organisations often have their data warehouse made up of one or more spreadsheets
2
 This has a knock-on effect of increasing the time required and complexity of populating the
database
Scope
This document is not a full evaluation of Infobright, nor is it an endorsement of the product. Rather it
describes the reasons for, approach, and results of an evaluation of Infobright Community Edition
with a limited number of real-life data queries.


About Infobright
Infobright is a database designed to solve analytical queries. It is built on MySQL but uses a different
storage engine, Brighthouse, rather than one of the standard storage engines (e.g. MyISAM,
InnoDB).

Infobright does not use indexes or aggregate tables but instead relies on the fact that it is a column-
oriented (columnar) database which is why it is more suited to aggregate analytics.

This is for the most part invisible to the user (depending on which edition is used) and Infobright can
be accessed through the same clients used for a regular MySQL instance.

Infobright comes in two flavours. The Community Edition (ICE) is Open Source Software and the
Enterprise Edition (IEE) is a commercial product. The chief differences between the two offerings are
support for data loading and DML (i.e. INSERT, UPDATE, DELETE).


Evaluation
We performed a limited evaluation to determine whether ICE would provide benefits in a real-life
situation.

We used data from a warehouse that belonging to one of our clients and worked with them to
understand analysis that they would like to be able to perform but up to now have not been able to.
The data and problem domain has been made anonymous and generic within this report to protect
client confidentiality.

The key principles for the evaluation were:

       Use real data volumes

       Ask real questions of the data


Aim
The aim of the evaluation was intended to understand how an Infobright Community Edition (ICE)
database compared to a standard MySQL database (using an InnoDB storage engine) over the
following dimensions:

       User response times to sample queries

       Storage space required by the database


Specifications
Tests were performed on a desktop developer’s machine

       Pentium Dual-core 2.16GHz, 3Gb RAM, Windows XP Professional

       MySQL Community Edition 5.1
o   Using InnoDB

       Infobright Community Edition 3.3.1

       HeidiSQL was used to run the queries

       Approximately a year’s worth of historical data was loaded in to the databases. This equated
        to 1.3 million rows.


Approach
In both cases the databases were loaded with approximately a year’s worth of data – this equated to
1,291,062 rows.

The time taken to load the databases was not compared as ICE only allows load from flat file3
although as a note it took 1’29” to load the data in to ICE.


Test 1: Comparing storage requirements
In this case, the same data was loaded in to both databases but the InnoDB database had no
optimisations applied (i.e. no keys, indexes, aggregates, etc). This was in order to limit the space to
only the data.


Test 2: Comparing reponse times
The second test was to compare the performance of an ICE database against that of an optimised
InnoDB database. The database could not be optimised for all queries against it (as they are ad-hoc)
but was optimised for only the selected queries.




3
 IEE allows population through more means (e.g. using DML, binary dumps rather than ASCII). See
more at http://bit.ly/aXQvKM
Results

Test 1: ICE compared to a non-optimised InnoDb database
Storage space
Infobright needed 17.7Mb to store the 1,291,062 rows versus 203.8 Mb needed by InnoDB.




Response times
Query                                               Infobright       InnoDB              x
                                                                                         Faster

top 10 customers by quantity                        3.828            147.781             39

top 10 customers by revenue                         7.734            124.703             16

top 10 customers with revenue between 300K and      8.109            160.094             20
600K

top 10 customers by quantity between Jan and Apr    1.235            21.703              18
Test 2: ICE compared to an optimised InnoDb database
Storage space

Response times


Conclusions

About the author

More Related Content

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Featured

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Using Infobright Community Edition For Analytics

  • 1. Improving the performance of ad-hoc analysis of large datasets True North R&D: Evaluation of Infobright Community Edition
  • 2. Situation Most organisations will have at least one data warehouse or data marts containing business data specific to a department. These databases typically feed management information (MIS) and/or business intelligence (BI) solutions and are in larger organisations are usually relational data stores optimised to perform particular tasks1. Often business users want to perform additional analysis on the data in the warehouse or mart in order to gain insights in to customer or employee behaviour. Examples of this might be “Who are my top 10 customers buying widgets in the following regions over the past six months?”; “Which employees over director grade and in the IT department spend the most on employee benefits”; “Which customers using the Safari browser who click on the Swedish landing page go on to spend over 100 krone”. The problem This desire to perform ad-hoc analysis or data mining can lead to difficulties for the teams that own and provide access to the data. This is because data marts are usually optimised for a particular set of use cases and hence are aggregated and indexed on the dimensions that match the use cases. So a Sales data mart may be built to query on dimensions of product code, region, sales manager, but may not be geared up to answer queries as to the marketing campaign code of the product. The data warehouse itself (if a traditional warehouse) will not make any optimisations along dimensions. For this reason, users are often discouraged or prevented from performing this type of analysis on data warehouses. If they are allowed access there are two opposing factors:  Long response times to ad-hoc queries lead to a poor user experience  Database optimisations (indexes and aggregate tables) greatly increase the amount of storage required2 Reason for this evaluation Several of our current clients would benefit from being able to mine their data marts in an efficient and productive (from a user experience perspective) manner. This document looks at a potential solution to part of that problem – in enabling efficient access to the data both from the point of view of storage and response times. This document was an evaluation of Infobright Community Edition (ICE) as a means of enabling ad- hoc analysis of metric data. 1 Smaller organisations often have their data warehouse made up of one or more spreadsheets 2 This has a knock-on effect of increasing the time required and complexity of populating the database
  • 3. Scope This document is not a full evaluation of Infobright, nor is it an endorsement of the product. Rather it describes the reasons for, approach, and results of an evaluation of Infobright Community Edition with a limited number of real-life data queries. About Infobright Infobright is a database designed to solve analytical queries. It is built on MySQL but uses a different storage engine, Brighthouse, rather than one of the standard storage engines (e.g. MyISAM, InnoDB). Infobright does not use indexes or aggregate tables but instead relies on the fact that it is a column- oriented (columnar) database which is why it is more suited to aggregate analytics. This is for the most part invisible to the user (depending on which edition is used) and Infobright can be accessed through the same clients used for a regular MySQL instance. Infobright comes in two flavours. The Community Edition (ICE) is Open Source Software and the Enterprise Edition (IEE) is a commercial product. The chief differences between the two offerings are support for data loading and DML (i.e. INSERT, UPDATE, DELETE). Evaluation We performed a limited evaluation to determine whether ICE would provide benefits in a real-life situation. We used data from a warehouse that belonging to one of our clients and worked with them to understand analysis that they would like to be able to perform but up to now have not been able to. The data and problem domain has been made anonymous and generic within this report to protect client confidentiality. The key principles for the evaluation were:  Use real data volumes  Ask real questions of the data Aim The aim of the evaluation was intended to understand how an Infobright Community Edition (ICE) database compared to a standard MySQL database (using an InnoDB storage engine) over the following dimensions:  User response times to sample queries  Storage space required by the database Specifications Tests were performed on a desktop developer’s machine  Pentium Dual-core 2.16GHz, 3Gb RAM, Windows XP Professional  MySQL Community Edition 5.1
  • 4. o Using InnoDB  Infobright Community Edition 3.3.1  HeidiSQL was used to run the queries  Approximately a year’s worth of historical data was loaded in to the databases. This equated to 1.3 million rows. Approach In both cases the databases were loaded with approximately a year’s worth of data – this equated to 1,291,062 rows. The time taken to load the databases was not compared as ICE only allows load from flat file3 although as a note it took 1’29” to load the data in to ICE. Test 1: Comparing storage requirements In this case, the same data was loaded in to both databases but the InnoDB database had no optimisations applied (i.e. no keys, indexes, aggregates, etc). This was in order to limit the space to only the data. Test 2: Comparing reponse times The second test was to compare the performance of an ICE database against that of an optimised InnoDB database. The database could not be optimised for all queries against it (as they are ad-hoc) but was optimised for only the selected queries. 3 IEE allows population through more means (e.g. using DML, binary dumps rather than ASCII). See more at http://bit.ly/aXQvKM
  • 5. Results Test 1: ICE compared to a non-optimised InnoDb database Storage space Infobright needed 17.7Mb to store the 1,291,062 rows versus 203.8 Mb needed by InnoDB. Response times Query Infobright InnoDB x Faster top 10 customers by quantity 3.828 147.781 39 top 10 customers by revenue 7.734 124.703 16 top 10 customers with revenue between 300K and 8.109 160.094 20 600K top 10 customers by quantity between Jan and Apr 1.235 21.703 18
  • 6. Test 2: ICE compared to an optimised InnoDb database Storage space Response times Conclusions About the author