Big Data Glossary of terms


Published on

A simple glossary of some of the important phrases to know about Big Data and Advanced Analytics

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data Glossary of terms

  1. 1. Glossary of TermsTerm Definition Significance10GbE (Ethernet)NetworkingNetwork cabling capable of supporting thetransmission of data at a rate of up to 10gigabits (10bn bits) per secondAs Kognitio unifies the resources of multiple nodesand randomly distributes the data, heavy use is madeof networking in the execution of queries so thehigher the network bandwidth the better with (dual)10GbE, as opposed to the more commonly available1GbE, being our preferred standardACID ACID (Atomicity, Consistency, Isolation,Durability) is a set of properties that guarantee adatabase transaction is processed reliably. Forexample, a transfer of funds from one bankaccount to anotherKognitio is ACID compliant. As a result, even though ithas been designed to carry out analytical workloads,it can also carry out transactional workloadsAmazon WebServices (AWS)A provider of public cloud infrastructure as aservice (IaaS), enabling the provisioning andhardware management of appliances on-demand based on an hourly chargeEnables applications to be considered that were notpreviously possible by increasing flexibility andconsiderably reducing short term costs and need forcapital expenditureAnalytical Platform A database platform that is specifically designedand built to manage analytical workloads ratherthan transactional workloadsKognitio provides a scalable analytical platform tosupport complex analytical applicationsAnalyticalWorkloadsAn analytical, as opposed to transactional,workload is one associated with the reportingand analysis of information. Typically analyticalworkloads will involve a relatively small number(compared with transactional workloads) ofquerying tasks on all or large subsets of theentire data set. As such, query performance isessentialKognitio has been designed to support analyticalrather than transactional workloadsBlade Servers A small form factor of server that enables highdensity compute power. Units do not carry theirown power supply, cooling, networking, etc. socannot be run independently of bladeenclosuresKognitio provides high performance computing andrequires a number of servers (scale-out) to achievethis. As the performance is achieved through holdingdata in RAM rather than on disk, compute density isessentialBlade Enclosures Supplies the power, cooling, networking, etc. forblade servers. Can contain several blades toprovide high density compute powerKognitio benefits from the compute density offered bythe blade server form factor.Cores Each core is an independent processing unit(CPU). CPU chips now include multiple cores thatare capable of processing multiple tasks inparallelMultiple cores facilitate the parallel processing ofdata which is a key driver of Kognitio’s performance.Kognitio can drive cores at 100% as part of providinglinear scalabilityCPU ’Central Processing Unit’ – the area of thecomputer that executes instructions andprocessesCPUs/cores are the driver of Kognitio’s performancecapabilitiesCube The name given to a multidimensional (hence‘cube’) structure built within an OLAP engineCubes can be designed and published, withoutbuilding, within the MDX designer associated withKognitioData Warehouse A central repository of information, created byintegrating data from one or more sourcesystems, that is used to support reporting andKognitio’s target markets are closely associated withdata warehousing
  2. 2. Glossary of Termsanalysis within an organisationDatabase Appliance A group of servers/nodes that are combined toform a pre-built and pre-configured MPPdatabase environment that can be used ‘out ofthe box’Appliances have an advantage over software as theycan be brought into service quickly, ensuring a fasterreturn on investment. Kognitio can be delivered as anapplianceDimension A group of related attributes, typically defined inone or more hierarchies, that enable the filteringand grouping of associated measures in a datawarehouseData warehousing is a key application withinKognitio’s target marketsDisk Data storage device – the common format forstoring data for processing. Typically a Hard DiskDrive (HDD) but may be a Solid State Drive (SSD)Provides a persistence layer for data held within orassociated with a Kognitio instance. As Kognitiousually provides multiple disks in an appliance, RAIDmethods can be used to improve resilienceElastic Block Store(EBS)An area of persistent block storage available onAWS infrastructure that can be attached to aserver. Typically used in database applicationsProvides the facility to persist a Kognitio platform thusenabling instances to be stopped and restarted whichconsiderably reduces on-demand infrastructure costsETL (ExtractTransform and Load)A process for taking data from operationalsystems, transforming it into information byapplying pre-defined processes to providecontext and loading it into, typically, a datawarehouse environment. A class of tools, such asInformatica, has grown up to providesophisticated capabilities to carry out thisfunctionality.This is a standard process within the datawarehousing space, an area that is closely associatedwith Kognitio’s target markets. Tools such asInformatica (a Kognitio strategic partner) workeffectively with Kognitio.External Scripting A Kognitio version 8 capability that enables anycode capable of running under Linux to beexecuted in parallel within a SQL framework onthe Kognitio Analytical Platform. Examplesinclude R, Python and Perl.Enables very high performance execution of complexanalytical processes by removing the bottleneckstraditionally associated with this workload, such asmoving the data to a single application server forprocessing. Note that some processes cannot beparallelized and, as such, will not be acceleratedExternal Tables A Kognitio version 8 capability that enables atable to be mapped onto an external data sourcebefore pulling the data into RAM. Each datasource requires a connector to be defined, withinitial connectors provided for Hadoop, S3 andother Kognitio instancesProvides a very flexible and powerful way to accessexternal data sources without the need for ETL toolsor scriptingFlash Memory/SSD(Solid State Disks)SSDs use flash memory to provide relativelyfaster access (than Hard Disk Drives) topersistent data without using moving parts(spinning disks/heads). Unlike RAM, data ispreserved after power loss. Access is stillconsiderably slower than RAM. As such, SSDs areNOT a direct replacement for RAMKognitio’s disk based environment can benefit fromthe provision of SSDs. However, as SSDs are generallyconsiderably more expensive than HDDs, it isrecommended that systems employ RAM rather thanSSDs as this will provide significantly greaterperformance benefits. Disk based competitorsgenerally benefit more from the inclusion of SSDs
  3. 3. Glossary of TermsHyperthreading Intel’s technology solution for increasing theparallelization capabilities of CPU cores. Eachhyperthread is ‘seen’, by operating systems thatsupport hyperthreading, as a separate core,enabling the workload to be shared betweenthemKognitio can effectively utilise hyperthreading toincrease the parallelization of processing, thusenhancing performance and throughputIn-memory database A database specifically designed to operatewithin RAM rather than one that is designed fordisk and utilises RAM to process data retrievedfrom disk blocks (caching)Kognitio has its roots as an in-memory database andgets its performance by storing data in RAM. This hasadvantages over caching in the fact that, if specificdata values or query results are not available withinthe cache, there will be a ‘cache miss’ which will resultin further (expensive) disk reads to acquire the dataJDBC Java DataBase Connectivity is a standard API foraccessing relational database managementsystems (RDBMS) for the Java programminglanguageKognitio supports the JDBC standard via a JDBC toODBC bridge provided by Simba TechnologiesLatency Time delay between initiating a request and anyactions associated with the request beingcompleted. Typically this will be the time takenfor a query to run. However, it could also beassociated with disk access times, load times,network transmission and time to insightKognitio holds data in RAM to make sure that it is asclose to the CPUs as possible, thus reducing thelatency associated with moving data and reducingquery times. In many use cases, it may not benecessary to write to disk, thus reducing latencyassociated with data loading. Time to insight is alsokey to the value proposition associated with theKognitio analytical platformLinear scalability The capability to improve performance in linewith system size. For example, doubling thepower of a system will result in the same querytime on twice the volume of data (NOTE: this isnot the same as doubling the power results inhalf the query time on the same data)As Kognitio has focused on reducing bottlenecks, itprovides linear scalability for both query and bulk loadperformance (insert rather than update – referentialintegrity has a significant impact)Massively ParallelProcessing (MPP)Parallel processing on a large scale, typicallyachieved through combining the processingcapabilities of a number of nodesKognitio combines the compute power of multiplenodes and CPUs to provide MPP capabilities toanalytical workloadsMDX(MultiDimensionaleXpressions)MDX is a language developed by Microsoft toenable querying of multidimensional data stores(OLAP) in much the same way that StructuredQuery Language (SQL) is used for relational datastores.MDX is a supported language for querying theKognitio Analytical Platform. It requires that a modelis in place that defines the relationships betweendimension and fact tables and a provider thatconverts the MDX code into SQL. A tool to design andbuild the model is available to KognitioMeasure In data warehousing, a measure is a propertythat can be aggregated (sum, count, average,etc.). For example, the number of units for aproduct in a retail basket is a measure.Data warehousing is a key application in Kognitio’smarketsMemory (RAM) Random Access Memory (RAM) is referred tosimply as ‘memory’ by Kognitio and is a form ofmemory that provides random access to data.Data does not persist in RAM when power is lostKognitio is an in-memory (RAM) analytical platform.As such Kognitio gains its performance advantage overdisk based environments when tables or images arestored in RAM as the data is kept close to the CPUs toreduce query and loading latency
  4. 4. Glossary of TermsNode A modular unit of a MPP architecture = a server(physical or virtual)Nodes form the basic units for constructing a KognitioMPP instanceNoSQL Databases Originally indicating that SQL was not used toquery the environment, this has since beenmodified to become “Not Only SQL”. NoSQLdatabases were designed to handle Big Data‘volumes, velocities and varieties’ and, as such,tend to provide less rigorous integrity andmetadata handling than relational databasemanagement systems. Built for scale out, theyare schema less and ‘eventually consistent’(BASE) rather than ACID compliant.Kognitio is NOT a NoSQL database but is incorporatingadditional scripting languages to provide NoSQLcapabilities. For business intelligence and ‘repeatable’analytics on a defined dataset, a schema is consideredto be a positive assetODBC Open DataBase Connectivity is a standard API foraccessing relational database managementsystems (RDBMS)ODBC is the standard approach for connecting to aKognitio instance. The majority of BI tools will supportgeneric ODBC connectivity and, hence, will likely beable to connect to a Kognitio instance. The exceptionstend to be OLAP clients, which will typically connectvia ODBC or XML/A, or tools that utilise JDBC or RESTinterfacesODBO OLE DB for OLAP is a Microsoft publishedstandard mechanism for connecting to OLAPdata sources via the MDX language. OLAPsources and clients may only adopt part of thestandard which can lead to connectivity andprocessing issues. ODBO is a two tierarchitecture (client and server)Kognitio, via its partner Simba, has an MDX providerinterface that can support ODBO connectivity.However, note that not all OLAP clients maynecessarily be supported owing to the variability withwhich tools have incorporated the standard.OLAP OnLine Analytical Processing is a representationof a business intelligence model suitable forconsumption by non-technical users. Typicallydata would be stored in ‘cubes’ that containmeasures and hierarchical dimensions which arelogically grouped in the manner that businessesreference them (e.g. a product hierarchyconsisting of product group, sub-group, family,sub-family and product).Traditional cubes would be pre-calculated atintervals with aggregated measures stored at thevarious levels and combinations of thedimensions to facilitate very fast access. Thecubes would be accessed by purpose built clientsand, typically, by the specially defined MDXlanguageKognitio provides the facility to view the AnalyticalPlatform via an OLAP model utilising connectivitysoftware provided by Simba Technologies. Ratherthan pre-calculating OLAP cubes Kognitio utilises theperformance characteristics of the platform toprovide virtual cubes which eliminates the lengthybuild times associated with OLAPOLTP (OnLineTransactionProcessing)A class of system designed to managetransaction oriented workloads. An OLTPdatabase will be specifically designed to managedata entered, produced or processed by atransactional system and, hence, is designed forthe rapid insertion and updating of recordswithin a tableWhilst Kognitio can support OLTP associatedworkloads, it was designed for analytical workloadsand, hence, is suboptimal for OLTP environments
  5. 5. Glossary of TermsParallel Processing The simultaneous use of more than one CPU orcore to execute a program. Operations that canbe performed in parallel will execute fasterwithin a parallel computing framework(potentially proportionate to the number ofcores/CPUs available). The overall effectivenessof the parallelism may be limited by tasks thatare executed seriallyKognitio has a strong parallel architecture andachieves its performance through parallelism acrossmultiple nodes, multiple CPUs and associated cores.This enables Kognitio to provide linear scalability inline with increasing memory size and core countsPersistence Layer An area provided to ensure that data ismaintained when a server/appliance is powereddown, typically hard disk based.Data in RAM does not persist when the hardware ispowered down so if data is required to persist itshould be stored within this layer. For physical devicesthis will typically be local disk based. However, forAWS based instances this has to be managed in adifferent way as local storage is ephemeral, meaningthat the disk drives are wiped when a server isterminated. EBS or S3 storage are typically used toprovide persistence in AWSPrivate Cloud Provision of non-publicly available infrastructureon-demand – see public cloud. Private cloudstypically provide additional certified standardscompared with public cloudsKognitio provides its own infrastructure to clients,which is referred to as a ‘private cloud’. Provisioning isdone on a term basis rather than on-demand but ismaintained by Kognitio or its partners offsite forcustomer’s use rather than on-premise. Provides afacility to customers to get environments up andrunning quickly without up-front capital expenditurePublic Cloud The publicly available provisioning of sharedcomputing infrastructure. Typically this isachieved through virtualization and is generallyprovided on-demand with no upfront capitalcostsEnables Kognitio to provide access to a pre-configuredappliance on-demand rather than in days (privatecloud) or weeks/months (on-premise appliance).Kognitio uses Amazon Web Services (AWS) to providethis facility but, in principle, any provider could beusedR language R is software and its associated syntax languagefor providing statistical computation andgraphics. It is open source and has grown tobecome a standard for statistical processing withparticularly high penetration in the academicworld and, increasingly, the data sciencecommunityKognitio has recently added support for the Rlanguage via the external scripting capability in v8RAID Redundant Array of Inexpensive/IndependentDisks is a storage mechanism to combinemultiple disks into a single logical unit. Data isdistributed across the disks for the purposes ofimproved performance or resilience. There areseveral levels of RAID available which providedifferent performance and resiliencecharacteristics.A Kognitio appliance uses RAID 1 (mirroring) to ensurethat the appliance does not lose data should a nodebecome unavailable.Racks Physical frameworks for holding an array ofservers or blade enclosures specifically designedto be mounted within the frameworkKognitio appliances utilise racks
  6. 6. Glossary of TermsRackmounts Independent, fully self-contained servers. Theseservers are generally larger than blades and canhave more RAM, CPU, Disk, etc. Whilst they aretypically housed in a rack, rackmounts provideflexibility over blades in that a limited number(up to three practically) can be stackedindependently (with switching) to form anappliance without the need for rackinfrastructureKognitio appliances can be based on rackmounts aswell as blade servers. For certain applications,rackmounts can provide cost advantages over bladeservers (e.g. small appliances up to 768Gb RAM)RAM OnlyTemporary Tables(ROTT)A table in Kognitio RAM with no associatedstorage (protection) in the persistence layer.This means that, whilst the structure ispersistent, the data is ephemeralROTTs are used for non-persistent workloads. Forexample, they provide the highest potential loadspeeds for data that needs to be processed before it ispersisted. The alternative, tables and table images,would involve writing to disk with the resultant delay.Failure of an appliance will result in loss of data heldin ROTTsReferential Integrity This is the process of ensuring that the dataentered in a column is valid. For example, in arelational table, a column may be specified as aforeign key (i.e. the data must exist in anothertable) in which case, at data load time, thisconstraint will be checked before the data isentered. Failure of the constraint will result inthe data not being enteredKognitio is fully ACID compliant and supportsreferential integrity. However, tables need to be inRAM to perform this task and the process has asevere impact on load performance since it results in afull table scan for each referential integrity check.Careful consideration needs to be given to applicationdesign implicationsScale-up To increase the size of a server through theaddition of new resources (CPU/memory)Many databases can only utilise single servers, so theability to incorporate greater resources is necessaryfor them to address larger data sets. However, thereare cost implications to scaling-up (e.g. larger memoryDIMMS tend to be considerably more expensive) andlimitations to the data set sizes that can be addressed.Kognitio can fully utilise the resources available in ascaled-up environmentScale-out To increase the size of an appliance through theaddition of more nodesWhilst utilising scaling-up to increase data sizesaddressable by databases is common, it is lesscommon to be able to do this by scaling-out. Kognitioaddresses larger data sizes through scaling-out. It canoften involve less capital outlay to have severalsmaller nodes than one very large server and the sizelimitations of a single server are removedS3 (Simple StorageService)A cost effective, secure and highly available filestorage area available on AWS cloudinfrastructure.Provides the facility to stage data files ready to loadinto a Kognitio Analytical Platform. Also provides anenvironment to store readily available backups andassociated files. Kognitio, in v8, has a connector thatcan map external tables onto S3 and load the entirefile into RAMSQL (StructuredQuery Language)SQL is a language designed to manage and querydata held in a relational data storeSQL is the standard used for querying the KognitioAnalytical Platform.
  7. 7. Glossary of TermsSwitch A network switch enables the linking of multiplenetwork devicesKognitio appliances require the cooperativeprocessing of multiple nodes. As such, switches arerequired to facilitate the flow of data/messagepassing between nodes. Note: for appliancesinvolving two nodes, no switching is required as thenodes are linked peer to peer.Table Image A Kognitio table that is simultaneously availablein RAM and on disk. The table may becompletely or partially (only selected columns orrows) represented in RAMTable images enable both performant queries andpersistenceTime to Insight The time taken from the point at which the dataof interest is generated in an operational systemto the point at which it has been analysed. Thisinvolves several aspects: Volume of data Velocity of data Network speed Need to move data Load speed Query speedKognitio has the ability to ingest and query data veryquickly (not just query). As such, Kognitio’s time toinsight is considerably lower than many othercompetitive products such as those which rely onaccelerative structures (OLAP, indexes, columnar) toprovide acceptable query performance (as thisimpacts on load speed)TransactionalWorkloadsA transactional, as opposed to analytical,workload is one that involves a large number(compared to analytical workloads) of smallprocesses that may involve locating, inserting,updating or deleting rather than querying data.Transaction speed and referential integrity arecritical to this workloadWhilst Kognitio can support transactional workloads,it has been designed to manage analytical workloads.As such, for transactional environments, it is highlylikely that OLTP databases will more appropriatelyfulfil the requirementView Image An in-memory instantiation (copy of results) of aview in Kognitio. At the point of instantiation,processing (such as joins, groupbys, etc.)associated with the view is undertaken and theresults physically stored in RAMView images considerably enhance the performanceof queries where the views are used repeatedly as theprocessing in the view only needs to be carried outonce. Allows different representations of commonunderlying dataXML/A XML for Analysis is a published standardmechanism for connecting to analytical datasources such as OLAP (via the MDX language)and data mining. XML/A is a three tierarchitecture (client, mid-tier and server) enablingthe caching of results to be incorporated whichcan considerably increase the speed of satisfyingcommon user community queriesKognitio, via its Simba Technologies developed MDXprovider, can support XML/A connectivity to OLAPobjects. However, note that not all OLAP clients maynecessarily be supported owing to the variability withwhich tools have incorporated the standard.Kognitio’s implementation incorporates a caching tierthat can enhance query and concurrencyperformance.