Big Data Glossary of terms

Glossary of Terms
Term Definition Significance
10GbE (Ethernet)
Networking
Network cabling capable of supporting the
transmission of data at a rate of up to 10
gigabits (10bn bits) per second
As Kognitio unifies the resources of multiple nodes
and randomly distributes the data, heavy use is made
of networking in the execution of queries so the
higher the network bandwidth the better with (dual)
10GbE, as opposed to the more commonly available
1GbE, being our preferred standard
ACID ACID (Atomicity, Consistency, Isolation,
Durability) is a set of properties that guarantee a
database transaction is processed reliably. For
example, a transfer of funds from one bank
account to another
Kognitio is ACID compliant. As a result, even though it
has been designed to carry out analytical workloads,
it can also carry out transactional workloads
Amazon Web
Services (AWS)
A provider of public cloud infrastructure as a
service (IaaS), enabling the provisioning and
hardware management of appliances on-
demand based on an hourly charge
Enables applications to be considered that were not
previously possible by increasing flexibility and
considerably reducing short term costs and need for
capital expenditure
Analytical Platform A database platform that is specifically designed
and built to manage analytical workloads rather
than transactional workloads
Kognitio provides a scalable analytical platform to
support complex analytical applications
Analytical
Workloads
An analytical, as opposed to transactional,
workload is one associated with the reporting
and analysis of information. Typically analytical
workloads will involve a relatively small number
(compared with transactional workloads) of
querying tasks on all or large subsets of the
entire data set. As such, query performance is
essential
Kognitio has been designed to support analytical
rather than transactional workloads
Blade Servers A small form factor of server that enables high
density compute power. Units do not carry their
own power supply, cooling, networking, etc. so
cannot be run independently of blade
enclosures
Kognitio provides high performance computing and
requires a number of servers (scale-out) to achieve
this. As the performance is achieved through holding
data in RAM rather than on disk, compute density is
essential
Blade Enclosures Supplies the power, cooling, networking, etc. for
blade servers. Can contain several blades to
provide high density compute power
Kognitio benefits from the compute density offered by
the blade server form factor.
Cores Each core is an independent processing unit
(CPU). CPU chips now include multiple cores that
are capable of processing multiple tasks in
parallel
Multiple cores facilitate the parallel processing of
data which is a key driver of Kognitio’s performance.
Kognitio can drive cores at 100% as part of providing
linear scalability
CPU ’Central Processing Unit’ – the area of the
computer that executes instructions and
processes
CPUs/cores are the driver of Kognitio’s performance
capabilities
Cube The name given to a multidimensional (hence
‘cube’) structure built within an OLAP engine
Cubes can be designed and published, without
building, within the MDX designer associated with
Kognitio
Data Warehouse A central repository of information, created by
integrating data from one or more source
systems, that is used to support reporting and
Kognitio’s target markets are closely associated with
data warehousing

Glossary of Terms
analysis within an organisation
Database Appliance A group of servers/nodes that are combined to
form a pre-built and pre-configured MPP
database environment that can be used ‘out of
the box’
Appliances have an advantage over software as they
can be brought into service quickly, ensuring a faster
return on investment. Kognitio can be delivered as an
appliance
Dimension A group of related attributes, typically defined in
one or more hierarchies, that enable the filtering
and grouping of associated measures in a data
warehouse
Data warehousing is a key application within
Kognitio’s target markets
Disk Data storage device – the common format for
storing data for processing. Typically a Hard Disk
Drive (HDD) but may be a Solid State Drive (SSD)
Provides a persistence layer for data held within or
associated with a Kognitio instance. As Kognitio
usually provides multiple disks in an appliance, RAID
methods can be used to improve resilience
Elastic Block Store
(EBS)
An area of persistent block storage available on
AWS infrastructure that can be attached to a
server. Typically used in database applications
Provides the facility to persist a Kognitio platform thus
enabling instances to be stopped and restarted which
considerably reduces on-demand infrastructure costs
ETL (Extract
Transform and Load)
A process for taking data from operational
systems, transforming it into information by
applying pre-defined processes to provide
context and loading it into, typically, a data
warehouse environment. A class of tools, such as
Informatica, has grown up to provide
sophisticated capabilities to carry out this
functionality.
This is a standard process within the data
warehousing space, an area that is closely associated
with Kognitio’s target markets. Tools such as
Informatica (a Kognitio strategic partner) work
effectively with Kognitio.
External Scripting A Kognitio version 8 capability that enables any
code capable of running under Linux to be
executed in parallel within a SQL framework on
the Kognitio Analytical Platform. Examples
include R, Python and Perl.
Enables very high performance execution of complex
analytical processes by removing the bottlenecks
traditionally associated with this workload, such as
moving the data to a single application server for
processing. Note that some processes cannot be
parallelized and, as such, will not be accelerated
External Tables A Kognitio version 8 capability that enables a
table to be mapped onto an external data source
before pulling the data into RAM. Each data
source requires a connector to be defined, with
initial connectors provided for Hadoop, S3 and
other Kognitio instances
Provides a very flexible and powerful way to access
external data sources without the need for ETL tools
or scripting
Flash Memory/SSD
(Solid State Disks)
SSDs use flash memory to provide relatively
faster access (than Hard Disk Drives) to
persistent data without using moving parts
(spinning disks/heads). Unlike RAM, data is
preserved after power loss. Access is still
considerably slower than RAM. As such, SSDs are
NOT a direct replacement for RAM
Kognitio’s disk based environment can benefit from
the provision of SSDs. However, as SSDs are generally
considerably more expensive than HDDs, it is
recommended that systems employ RAM rather than
SSDs as this will provide significantly greater
performance benefits. Disk based competitors
generally benefit more from the inclusion of SSDs

Glossary of Terms
Hyperthreading Intel’s technology solution for increasing the
parallelization capabilities of CPU cores. Each
hyperthread is ‘seen’, by operating systems that
support hyperthreading, as a separate core,
enabling the workload to be shared between
them
Kognitio can effectively utilise hyperthreading to
increase the parallelization of processing, thus
enhancing performance and throughput
In-memory database A database specifically designed to operate
within RAM rather than one that is designed for
disk and utilises RAM to process data retrieved
from disk blocks (caching)
Kognitio has its roots as an in-memory database and
gets its performance by storing data in RAM. This has
advantages over caching in the fact that, if specific
data values or query results are not available within
the cache, there will be a ‘cache miss’ which will result
in further (expensive) disk reads to acquire the data
JDBC Java DataBase Connectivity is a standard API for
accessing relational database management
systems (RDBMS) for the Java programming
language
Kognitio supports the JDBC standard via a JDBC to
ODBC bridge provided by Simba Technologies
Latency Time delay between initiating a request and any
actions associated with the request being
completed. Typically this will be the time taken
for a query to run. However, it could also be
associated with disk access times, load times,
network transmission and time to insight
Kognitio holds data in RAM to make sure that it is as
close to the CPUs as possible, thus reducing the
latency associated with moving data and reducing
query times. In many use cases, it may not be
necessary to write to disk, thus reducing latency
associated with data loading. Time to insight is also
key to the value proposition associated with the
Kognitio analytical platform
Linear scalability The capability to improve performance in line
with system size. For example, doubling the
power of a system will result in the same query
time on twice the volume of data (NOTE: this is
not the same as doubling the power results in
half the query time on the same data)
As Kognitio has focused on reducing bottlenecks, it
provides linear scalability for both query and bulk load
performance (insert rather than update – referential
integrity has a significant impact)
Massively Parallel
Processing (MPP)
Parallel processing on a large scale, typically
achieved through combining the processing
capabilities of a number of nodes
Kognitio combines the compute power of multiple
nodes and CPUs to provide MPP capabilities to
analytical workloads
MDX
(MultiDimensional
eXpressions)
MDX is a language developed by Microsoft to
enable querying of multidimensional data stores
(OLAP) in much the same way that Structured
Query Language (SQL) is used for relational data
stores.
MDX is a supported language for querying the
Kognitio Analytical Platform. It requires that a model
is in place that defines the relationships between
dimension and fact tables and a provider that
converts the MDX code into SQL. A tool to design and
build the model is available to Kognitio
Measure In data warehousing, a measure is a property
that can be aggregated (sum, count, average,
etc.). For example, the number of units for a
product in a retail basket is a measure.
Data warehousing is a key application in Kognitio’s
markets
Memory (RAM) Random Access Memory (RAM) is referred to
simply as ‘memory’ by Kognitio and is a form of
memory that provides random access to data.
Data does not persist in RAM when power is lost
Kognitio is an in-memory (RAM) analytical platform.
As such Kognitio gains its performance advantage over
disk based environments when tables or images are
stored in RAM as the data is kept close to the CPUs to
reduce query and loading latency

Glossary of Terms
Node A modular unit of a MPP architecture = a server
(physical or virtual)
Nodes form the basic units for constructing a Kognitio
MPP instance
NoSQL Databases Originally indicating that SQL was not used to
query the environment, this has since been
modified to become “Not Only SQL”. NoSQL
databases were designed to handle Big Data
‘volumes, velocities and varieties’ and, as such,
tend to provide less rigorous integrity and
metadata handling than relational database
management systems. Built for scale out, they
are schema less and ‘eventually consistent’
(BASE) rather than ACID compliant.
Kognitio is NOT a NoSQL database but is incorporating
additional scripting languages to provide NoSQL
capabilities. For business intelligence and ‘repeatable’
analytics on a defined dataset, a schema is considered
to be a positive asset
ODBC Open DataBase Connectivity is a standard API for
accessing relational database management
systems (RDBMS)
ODBC is the standard approach for connecting to a
Kognitio instance. The majority of BI tools will support
generic ODBC connectivity and, hence, will likely be
able to connect to a Kognitio instance. The exceptions
tend to be OLAP clients, which will typically connect
via ODBC or XML/A, or tools that utilise JDBC or REST
interfaces
ODBO OLE DB for OLAP is a Microsoft published
standard mechanism for connecting to OLAP
data sources via the MDX language. OLAP
sources and clients may only adopt part of the
standard which can lead to connectivity and
processing issues. ODBO is a two tier
architecture (client and server)
Kognitio, via its partner Simba, has an MDX provider
interface that can support ODBO connectivity.
However, note that not all OLAP clients may
necessarily be supported owing to the variability with
which tools have incorporated the standard.
OLAP OnLine Analytical Processing is a representation
of a business intelligence model suitable for
consumption by non-technical users. Typically
data would be stored in ‘cubes’ that contain
measures and hierarchical dimensions which are
logically grouped in the manner that businesses
reference them (e.g. a product hierarchy
consisting of product group, sub-group, family,
sub-family and product).
Traditional cubes would be pre-calculated at
intervals with aggregated measures stored at the
various levels and combinations of the
dimensions to facilitate very fast access. The
cubes would be accessed by purpose built clients
and, typically, by the specially defined MDX
language
Kognitio provides the facility to view the Analytical
Platform via an OLAP model utilising connectivity
software provided by Simba Technologies. Rather
than pre-calculating OLAP cubes Kognitio utilises the
performance characteristics of the platform to
provide virtual cubes which eliminates the lengthy
build times associated with OLAP
OLTP (OnLine
Transaction
Processing)
A class of system designed to manage
transaction oriented workloads. An OLTP
database will be specifically designed to manage
data entered, produced or processed by a
transactional system and, hence, is designed for
the rapid insertion and updating of records
within a table
Whilst Kognitio can support OLTP associated
workloads, it was designed for analytical workloads
and, hence, is suboptimal for OLTP environments

Glossary of Terms
Parallel Processing The simultaneous use of more than one CPU or
core to execute a program. Operations that can
be performed in parallel will execute faster
within a parallel computing framework
(potentially proportionate to the number of
cores/CPUs available). The overall effectiveness
of the parallelism may be limited by tasks that
are executed serially
Kognitio has a strong parallel architecture and
achieves its performance through parallelism across
multiple nodes, multiple CPUs and associated cores.
This enables Kognitio to provide linear scalability in
line with increasing memory size and core counts
Persistence Layer An area provided to ensure that data is
maintained when a server/appliance is powered
down, typically hard disk based.
Data in RAM does not persist when the hardware is
powered down so if data is required to persist it
should be stored within this layer. For physical devices
this will typically be local disk based. However, for
AWS based instances this has to be managed in a
different way as local storage is ephemeral, meaning
that the disk drives are wiped when a server is
terminated. EBS or S3 storage are typically used to
provide persistence in AWS
Private Cloud Provision of non-publicly available infrastructure
on-demand – see public cloud. Private clouds
typically provide additional certified standards
compared with public clouds
Kognitio provides its own infrastructure to clients,
which is referred to as a ‘private cloud’. Provisioning is
done on a term basis rather than on-demand but is
maintained by Kognitio or its partners offsite for
customer’s use rather than on-premise. Provides a
facility to customers to get environments up and
running quickly without up-front capital expenditure
Public Cloud The publicly available provisioning of shared
computing infrastructure. Typically this is
achieved through virtualization and is generally
provided on-demand with no upfront capital
costs
Enables Kognitio to provide access to a pre-configured
appliance on-demand rather than in days (private
cloud) or weeks/months (on-premise appliance).
Kognitio uses Amazon Web Services (AWS) to provide
this facility but, in principle, any provider could be
used
R language R is software and its associated syntax language
for providing statistical computation and
graphics. It is open source and has grown to
become a standard for statistical processing with
particularly high penetration in the academic
world and, increasingly, the data science
community
Kognitio has recently added support for the R
language via the external scripting capability in v8
RAID Redundant Array of Inexpensive/Independent
Disks is a storage mechanism to combine
multiple disks into a single logical unit. Data is
distributed across the disks for the purposes of
improved performance or resilience. There are
several levels of RAID available which provide
different performance and resilience
characteristics.
A Kognitio appliance uses RAID 1 (mirroring) to ensure
that the appliance does not lose data should a node
become unavailable.
Racks Physical frameworks for holding an array of
servers or blade enclosures specifically designed
to be mounted within the framework
Kognitio appliances utilise racks

Glossary of Terms
Rackmounts Independent, fully self-contained servers. These
servers are generally larger than blades and can
have more RAM, CPU, Disk, etc. Whilst they are
typically housed in a rack, rackmounts provide
flexibility over blades in that a limited number
(up to three practically) can be stacked
independently (with switching) to form an
appliance without the need for rack
infrastructure
Kognitio appliances can be based on rackmounts as
well as blade servers. For certain applications,
rackmounts can provide cost advantages over blade
servers (e.g. small appliances up to 768Gb RAM)
RAM Only
Temporary Tables
(ROTT)
A table in Kognitio RAM with no associated
storage (protection) in the persistence layer.
This means that, whilst the structure is
persistent, the data is ephemeral
ROTTs are used for non-persistent workloads. For
example, they provide the highest potential load
speeds for data that needs to be processed before it is
persisted. The alternative, tables and table images,
would involve writing to disk with the resultant delay.
Failure of an appliance will result in loss of data held
in ROTTs
Referential Integrity This is the process of ensuring that the data
entered in a column is valid. For example, in a
relational table, a column may be specified as a
foreign key (i.e. the data must exist in another
table) in which case, at data load time, this
constraint will be checked before the data is
entered. Failure of the constraint will result in
the data not being entered
Kognitio is fully ACID compliant and supports
referential integrity. However, tables need to be in
RAM to perform this task and the process has a
severe impact on load performance since it results in a
full table scan for each referential integrity check.
Careful consideration needs to be given to application
design implications
Scale-up To increase the size of a server through the
addition of new resources (CPU/memory)
Many databases can only utilise single servers, so the
ability to incorporate greater resources is necessary
for them to address larger data sets. However, there
are cost implications to scaling-up (e.g. larger memory
DIMMS tend to be considerably more expensive) and
limitations to the data set sizes that can be addressed.
Kognitio can fully utilise the resources available in a
scaled-up environment
Scale-out To increase the size of an appliance through the
addition of more nodes
Whilst utilising scaling-up to increase data sizes
addressable by databases is common, it is less
common to be able to do this by scaling-out. Kognitio
addresses larger data sizes through scaling-out. It can
often involve less capital outlay to have several
smaller nodes than one very large server and the size
limitations of a single server are removed
S3 (Simple Storage
Service)
A cost effective, secure and highly available file
storage area available on AWS cloud
infrastructure.
Provides the facility to stage data files ready to load
into a Kognitio Analytical Platform. Also provides an
environment to store readily available backups and
associated files. Kognitio, in v8, has a connector that
can map external tables onto S3 and load the entire
file into RAM
SQL (Structured
Query Language)
SQL is a language designed to manage and query
data held in a relational data store
SQL is the standard used for querying the Kognitio
Analytical Platform.

Glossary of Terms
Switch A network switch enables the linking of multiple
network devices
Kognitio appliances require the cooperative
processing of multiple nodes. As such, switches are
required to facilitate the flow of data/message
passing between nodes. Note: for appliances
involving two nodes, no switching is required as the
nodes are linked peer to peer.
Table Image A Kognitio table that is simultaneously available
in RAM and on disk. The table may be
completely or partially (only selected columns or
rows) represented in RAM
Table images enable both performant queries and
persistence
Time to Insight The time taken from the point at which the data
of interest is generated in an operational system
to the point at which it has been analysed. This
involves several aspects:
 Volume of data
 Velocity of data
 Network speed
 Need to move data
 Load speed
 Query speed
Kognitio has the ability to ingest and query data very
quickly (not just query). As such, Kognitio’s time to
insight is considerably lower than many other
competitive products such as those which rely on
accelerative structures (OLAP, indexes, columnar) to
provide acceptable query performance (as this
impacts on load speed)
Transactional
Workloads
A transactional, as opposed to analytical,
workload is one that involves a large number
(compared to analytical workloads) of small
processes that may involve locating, inserting,
updating or deleting rather than querying data.
Transaction speed and referential integrity are
critical to this workload
Whilst Kognitio can support transactional workloads,
it has been designed to manage analytical workloads.
As such, for transactional environments, it is highly
likely that OLTP databases will more appropriately
fulfil the requirement
View Image An in-memory instantiation (copy of results) of a
view in Kognitio. At the point of instantiation,
processing (such as joins, groupbys, etc.)
associated with the view is undertaken and the
results physically stored in RAM
View images considerably enhance the performance
of queries where the views are used repeatedly as the
processing in the view only needs to be carried out
once. Allows different representations of common
underlying data
XML/A XML for Analysis is a published standard
mechanism for connecting to analytical data
sources such as OLAP (via the MDX language)
and data mining. XML/A is a three tier
architecture (client, mid-tier and server) enabling
the caching of results to be incorporated which
can considerably increase the speed of satisfying
common user community queries
Kognitio, via its Simba Technologies developed MDX
provider, can support XML/A connectivity to OLAP
objects. However, note that not all OLAP clients may
necessarily be supported owing to the variability with
which tools have incorporated the standard.
Kognitio’s implementation incorporates a caching tier
that can enhance query and concurrency
performance.

Big Data Glossary of terms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Big Data Glossary of terms

Similar to Big Data Glossary of terms (20)

Big Data Glossary of terms