View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Improving the performance of ad-hoc
analysis of large datasets
True North R&D: Evaluation of Infobright Community Edition
Most organisations will have at least one data warehouse or data marts containing business data
specific to a department. These databases typically feed management information (MIS) and/or
business intelligence (BI) solutions and are in larger organisations are usually relational data stores
optimised to perform particular tasks1.
Often business users want to perform additional analysis on the data in the warehouse or mart in
order to gain insights in to customer or employee behaviour. Examples of this might be “Who are my
top 10 customers buying widgets in the following regions over the past six months?”; “Which
employees over director grade and in the IT department spend the most on employee benefits”;
“Which customers using the Safari browser who click on the Swedish landing page go on to spend
over 100 krone”.
This desire to perform ad-hoc analysis or data mining can lead to difficulties for the teams that own
and provide access to the data.
This is because data marts are usually optimised for a particular set of use cases and hence are
aggregated and indexed on the dimensions that match the use cases. So a Sales data mart may be
built to query on dimensions of product code, region, sales manager, but may not be geared up to
answer queries as to the marketing campaign code of the product. The data warehouse itself (if a
traditional warehouse) will not make any optimisations along dimensions.
For this reason, users are often discouraged or prevented from performing this type of analysis on
data warehouses. If they are allowed access there are two opposing factors:
Long response times to ad-hoc queries lead to a poor user experience
Database optimisations (indexes and aggregate tables) greatly increase the amount of
Reason for this evaluation
Several of our current clients would benefit from being able to mine their data marts in an efficient
and productive (from a user experience perspective) manner.
This document looks at a potential solution to part of that problem – in enabling efficient access to
the data both from the point of view of storage and response times.
This document was an evaluation of Infobright Community Edition (ICE) as a means of enabling ad-
hoc analysis of metric data.
Smaller organisations often have their data warehouse made up of one or more spreadsheets
This has a knock-on effect of increasing the time required and complexity of populating the
This document is not a full evaluation of Infobright, nor is it an endorsement of the product. Rather it
describes the reasons for, approach, and results of an evaluation of Infobright Community Edition
with a limited number of real-life data queries.
Infobright is a database designed to solve analytical queries. It is built on MySQL but uses a different
storage engine, Brighthouse, rather than one of the standard storage engines (e.g. MyISAM,
Infobright does not use indexes or aggregate tables but instead relies on the fact that it is a column-
oriented (columnar) database which is why it is more suited to aggregate analytics.
This is for the most part invisible to the user (depending on which edition is used) and Infobright can
be accessed through the same clients used for a regular MySQL instance.
Infobright comes in two flavours. The Community Edition (ICE) is Open Source Software and the
Enterprise Edition (IEE) is a commercial product. The chief differences between the two offerings are
support for data loading and DML (i.e. INSERT, UPDATE, DELETE).
We performed a limited evaluation to determine whether ICE would provide benefits in a real-life
We used data from a warehouse that belonging to one of our clients and worked with them to
understand analysis that they would like to be able to perform but up to now have not been able to.
The data and problem domain has been made anonymous and generic within this report to protect
The key principles for the evaluation were:
Use real data volumes
Ask real questions of the data
The aim of the evaluation was intended to understand how an Infobright Community Edition (ICE)
database compared to a standard MySQL database (using an InnoDB storage engine) over the
User response times to sample queries
Storage space required by the database
Tests were performed on a desktop developer’s machine
Pentium Dual-core 2.16GHz, 3Gb RAM, Windows XP Professional
MySQL Community Edition 5.1
o Using InnoDB
Infobright Community Edition 3.3.1
HeidiSQL was used to run the queries
Approximately a year’s worth of historical data was loaded in to the databases. This equated
to 1.3 million rows.
In both cases the databases were loaded with approximately a year’s worth of data – this equated to
The time taken to load the databases was not compared as ICE only allows load from flat file3
although as a note it took 1’29” to load the data in to ICE.
Test 1: Comparing storage requirements
In this case, the same data was loaded in to both databases but the InnoDB database had no
optimisations applied (i.e. no keys, indexes, aggregates, etc). This was in order to limit the space to
only the data.
Test 2: Comparing reponse times
The second test was to compare the performance of an ICE database against that of an optimised
InnoDB database. The database could not be optimised for all queries against it (as they are ad-hoc)
but was optimised for only the selected queries.
IEE allows population through more means (e.g. using DML, binary dumps rather than ASCII). See
more at http://bit.ly/aXQvKM
Test 1: ICE compared to a non-optimised InnoDb database
Infobright needed 17.7Mb to store the 1,291,062 rows versus 203.8 Mb needed by InnoDB.
Query Infobright InnoDB x
top 10 customers by quantity 3.828 147.781 39
top 10 customers by revenue 7.734 124.703 16
top 10 customers with revenue between 300K and 8.109 160.094 20
top 10 customers by quantity between Jan and Apr 1.235 21.703 18
Test 2: ICE compared to an optimised InnoDb database
About the author