Column-oriented databases like Infobright Community Edition are well-suited for data warehousing due to their high data compression rates and efficient handling of analytic queries. Infobright uses data packs, knowledge nodes, and an optimizer to retrieve only necessary column data without decompressing entire files. It achieves industry-leading compression of 10-40x by optimizing algorithms for each data type and stores metadata to resolve complex queries without traditional row-based indexing. By integrating with MySQL, Infobright leverages existing connectivity and provides a low-cost option for data warehousing and business intelligence.
Designing high performance datawarehouseUday Kothari
Just when the world of “Data 1.0” showed some signs of maturing; the “Outside In” driven demands seem to have already initiated some the disruptive changes to the data landscape. Parallel growth in volume, velocity and variety of data coupled with incessant war on finding newer insights and value from data has posed a Big Question: Is Your Data Warehouse Relevant?
In short, the surrounding changes happening real time is the new “Data 2.0”. It is characterized by feeding the ever hungry minds with sharper insights whether it is related to regulation, finance, corporate action, risk management or purely aimed at improving operational efficiencies. The source in this new “Data 2.0” has to be commensurate to the outside in demands from customers, regulators, stakeholders and business users; and hence, you would need a high relformance (relevance + performance) data warehouse which will be relevant to your business eco-system and will have the power to scale exponentially.
We starts this webinar by giving the audiences a sneak preview of what happened in the Data 1.0 world & which characteristics are shaping the new Data 2.0 world. It then delves deep on the challenges that growing data volumes have posed to the Data warehouse teams. It also presents the audiences some of the practical and proven methodologies to address these performance challenges. Finally, in the end it will highlight some of the thought provoking ways to turbo charge your data warehouse related initiatives by leveraging some of the newer technologies like Hadoop. Overall, the webinar will educate audiences with building high performance and relevant data warehouses which is capable of meeting the newer demands while significantly driving down the total cost of ownership.
We offer online IT training with placements, project assistance in different platforms with real time industry consultants to provide quality training for all it professionals, corporate clients and students etc. Special features by InformaticaTrainingClasses are Extensive Training will be in both Informatica Online Training and Placement. We help you in resume preparation and conducting Mock Interviews.
Emphasis is given on important topics which are essential and mostly used in real time projects. Informatica training Classes is an Online Training Leader when it comes to high-end effective and efficient I.T Training. We have always been and still are focusing on the key aspects which are providing utmost effective and competent training to both students and professionals who are eager to enrich their technical skills.
Training Features at Informatica training classes:
We believe that online training has to be measured by three major aspects viz., Quality, Content and Relationship with the Trainer and Student. Not only our online training classes are important but apart from that the material which we provide are in tune with the latest IT training standards, so a student has not to worry at all whether the training imparted is outdated or latest.
Course content:
• Basics of data warehousing concepts
• Power center components
• Informatica concepts and overview
• Sources
• Targets
• Transformations
• Advanced Informatica concepts
Please Visit us for the Demo Classes, we have regular batches and weekend batches.
Informatica online training classes
Phone: (404)-900-9988
Email: info@informaticatrainingclasses.com
Web: http://www.informaticatrainingclasses.com
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
Joe Caserta went over the details inside the big data ecosystem and the Caserta Concepts Data Pyramid, which includes Data Ingestion, Data Lake/Data Science Workbench and the Big Data Warehouse. He then dove into the foundation of dimensional data modeling, which is as important as ever in the top tier of the Data Pyramid. Topics covered:
- The 3 grains of Fact Tables
- Modeling the different types of Slowly Changing Dimensions
- Advanced Modeling techniques like Ragged Hierarchies, Bridge Tables, etc.
- ETL Architecture.
He also talked about ModelStorming, a technique used to quickly convert business requirements into an Event Matrix and Dimensional Data Model.
This was a jam-packed abbreviated version of 4 days of rigorous training of these techniques being taught in September by Joe Caserta (Co-Author, with Ralph Kimball, The Data Warehouse ETL Toolkit) and Lawrence Corr (Author, Agile Data Warehouse Design).
For more information, visit http://casertaconcepts.com/.
The Data Warehouse is a database which merges, summarizes and analyzes all data sources of a company/organization. Users can request particular data from the system (such the number of sales within a certain period) and will be provided with the respective information.
With the help of the Data Warehouse, you can quickly access different systems and look at historic data. Due to the vast amount of data it provides, the Data Warehouse is an essential tool when making management decisions.
Designing high performance datawarehouseUday Kothari
Just when the world of “Data 1.0” showed some signs of maturing; the “Outside In” driven demands seem to have already initiated some the disruptive changes to the data landscape. Parallel growth in volume, velocity and variety of data coupled with incessant war on finding newer insights and value from data has posed a Big Question: Is Your Data Warehouse Relevant?
In short, the surrounding changes happening real time is the new “Data 2.0”. It is characterized by feeding the ever hungry minds with sharper insights whether it is related to regulation, finance, corporate action, risk management or purely aimed at improving operational efficiencies. The source in this new “Data 2.0” has to be commensurate to the outside in demands from customers, regulators, stakeholders and business users; and hence, you would need a high relformance (relevance + performance) data warehouse which will be relevant to your business eco-system and will have the power to scale exponentially.
We starts this webinar by giving the audiences a sneak preview of what happened in the Data 1.0 world & which characteristics are shaping the new Data 2.0 world. It then delves deep on the challenges that growing data volumes have posed to the Data warehouse teams. It also presents the audiences some of the practical and proven methodologies to address these performance challenges. Finally, in the end it will highlight some of the thought provoking ways to turbo charge your data warehouse related initiatives by leveraging some of the newer technologies like Hadoop. Overall, the webinar will educate audiences with building high performance and relevant data warehouses which is capable of meeting the newer demands while significantly driving down the total cost of ownership.
We offer online IT training with placements, project assistance in different platforms with real time industry consultants to provide quality training for all it professionals, corporate clients and students etc. Special features by InformaticaTrainingClasses are Extensive Training will be in both Informatica Online Training and Placement. We help you in resume preparation and conducting Mock Interviews.
Emphasis is given on important topics which are essential and mostly used in real time projects. Informatica training Classes is an Online Training Leader when it comes to high-end effective and efficient I.T Training. We have always been and still are focusing on the key aspects which are providing utmost effective and competent training to both students and professionals who are eager to enrich their technical skills.
Training Features at Informatica training classes:
We believe that online training has to be measured by three major aspects viz., Quality, Content and Relationship with the Trainer and Student. Not only our online training classes are important but apart from that the material which we provide are in tune with the latest IT training standards, so a student has not to worry at all whether the training imparted is outdated or latest.
Course content:
• Basics of data warehousing concepts
• Power center components
• Informatica concepts and overview
• Sources
• Targets
• Transformations
• Advanced Informatica concepts
Please Visit us for the Demo Classes, we have regular batches and weekend batches.
Informatica online training classes
Phone: (404)-900-9988
Email: info@informaticatrainingclasses.com
Web: http://www.informaticatrainingclasses.com
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
Joe Caserta went over the details inside the big data ecosystem and the Caserta Concepts Data Pyramid, which includes Data Ingestion, Data Lake/Data Science Workbench and the Big Data Warehouse. He then dove into the foundation of dimensional data modeling, which is as important as ever in the top tier of the Data Pyramid. Topics covered:
- The 3 grains of Fact Tables
- Modeling the different types of Slowly Changing Dimensions
- Advanced Modeling techniques like Ragged Hierarchies, Bridge Tables, etc.
- ETL Architecture.
He also talked about ModelStorming, a technique used to quickly convert business requirements into an Event Matrix and Dimensional Data Model.
This was a jam-packed abbreviated version of 4 days of rigorous training of these techniques being taught in September by Joe Caserta (Co-Author, with Ralph Kimball, The Data Warehouse ETL Toolkit) and Lawrence Corr (Author, Agile Data Warehouse Design).
For more information, visit http://casertaconcepts.com/.
The Data Warehouse is a database which merges, summarizes and analyzes all data sources of a company/organization. Users can request particular data from the system (such the number of sales within a certain period) and will be provided with the respective information.
With the help of the Data Warehouse, you can quickly access different systems and look at historic data. Due to the vast amount of data it provides, the Data Warehouse is an essential tool when making management decisions.
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for knowledge workers throughout the enterprise.
Glimpse of advantage, limitations of Hadoop and Goals / Business benefits of Data Warehouse and few use cases where Hadoop can be used to strengthen Enterprise Data Warehouse of any organization.
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for knowledge workers throughout the enterprise.
Glimpse of advantage, limitations of Hadoop and Goals / Business benefits of Data Warehouse and few use cases where Hadoop can be used to strengthen Enterprise Data Warehouse of any organization.
Build a Big Data Warehouse on the Cloud in 30 MinutesCaserta
Elliott Cordo, Chief Architect at Caserta Concepts will give a live demo using Amazon's AWS to build a Big Data Warehouse using S3 for data storage, Elastic MapReduce (EMR) for data manipulation and Redshift for interactive queries.
For more information, visit http://www.casertaconcepts.com/.
Take your reports to the next dimension! In this session we will discuss how to combine the power of SSRS and SSAS to create cube driven reports. We will talk about using SSAS as a data source, writing MDX queries, using report parameters, passing parameters for drill down reports, performance tuning, and the pro’s and con’s of using a cube as your data source.
Jeff Prom is a Senior Consultant with Magenic Technologies. He holds a bachelor’s degree, three SQL Server certifications, and is an active PASS member. Jeff has been working in the IT industry for over 14 years and currently specializes in data and business intelligence.
This is the 3- Tier architecture of Data Warehouse. This is the topic under Data Mining subject. Data mining is extracting knowledge from large amount of data.
Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations.
Slides from a talk given at the database conference in Charlottesville, Virginia in November, 2008
Authors:
Dominik Ślezak, Chief Scientist, Infobright Founder
Victoria Eastwood, VP of Engineering
Susan Bantin, Sales Engineering
Alex Esterkin, Chief Architect
Column store databases approaches and optimization techniquesIJDKP
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for
the efficient query processing in read-intensive relational databases. Also, for read-intensive relational
databases,extensive research has performed for efficient data storage and query processing. This paper
gives an overview of storage and performance optimization techniques used in Column-Stores.
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
Data Warehouse Physical Design,Physical Data Model, Tablespaces, Integrity Constraints, ETL (Extract-Transform-Load) ,OLAP Server Architectures, MOLAP vs. ROLAP, Distributed Data Warehouse ,
Discover how database sharding https://bityl.co/Q6F3 can transform your application's performance by distributing data across multiple servers in our latest blog. With insights into key sharding techniques, you'll further learn how to implement sharding effectively and avoid common pitfalls. As you move forward, this blog will help you dive into real-life use cases to understand how sharding can optimize data management. Lastly, you'll get the most important factors to consider before sharding your database and learning to navigate the complexities of database management.
Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
help.mbaassignments@gmail.com
or
call us at : 08263069601
A quick tour in 16 slides of Amazon's Redshift clustered, massively parallel database.
Find out what differentiates it from the other database products Amazon has, including SimpleDB, DynamoDB and RDS (MySQL, SQL Server and Oracle).
Learn how it stores data on disk in a columnar format and how this relates to performance and interesting compression techniques.
Contrast the difference between Redshift and a MySQL instance and discover how the clustered architecture may help to dramatically reduce query time.
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
NOSQL is a database provides a mechanism for storage and retrieval of data that is modeled for huge amount of data which is used in big data and Cloud Computing . NOSQL systems are also called "Not only SQL" to emphasize that they may support SQL-like query languages. A basic classification of NOSQL is based on data model; they are like column, Document, Key-Value etc. The objective of this paper is to study and compare the implantation of various column oriented data stores like Bigtable, Cassandra.
Comparative study of no sql document, column store databases and evaluation o...ijdms
In the last decade, rapid growth in mobile applications, web technologies, social media generating
unstructured data has led to the advent of various nosql data stores. Demands of web scale are in
increasing trend everyday and nosql databases are evolving to meet up with stern big data requirements.
The purpose of this paper is to explore nosql technologies and present a comparative study of document
and column store nosql databases such as cassandra, MongoDB and Hbase in various attributes of
relational and distributed database system principles. Detailed study and analysis of architecture and
internal working cassandra, Mongo DB and HBase is done theoretically and core concepts are depicted.
This paper also presents evaluation of cassandra for an industry specific use case and results are
published.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Data Warehousing Consist have two aspects
A Datawarehouse Or Data Store
An Application software or system for Enterprise Reporting , analysis , data mining
and ETL capabilities ( extract – transform – load) for business intelligence .
There are several open source database management system but a few of them suitable for
data warehousing capabilities.
The most capability suitable for Data warehousing is Column-Oriented Architecture .
A column-oriented DBMS is a database management system (DBMS) that stores its content by
column rather than by row.
3. Description of Colum-Oriented Architecture
A database program must show its data as two-dimensional tables, of columns and rows, but
store it as one-dimensional strings. For example, a database might have this table.
EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 4400
A row-oriented database serializes all of the values in a row together, then the values in the next
row, and so on.
1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;
A column-oriented database serializes all of the values of a column together, then the values of
the next column, and so on.
1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
4. Partitioning, indexing, caching, views, OLAP cubes, and transactional systems such as write
ahead logging or multiversion concurrency control all dramatically affect the physical organization.
The online transaction processing (OLTP)-focused RDBMS systems are more row-oriented.
The online analytical processing (OLAP)-focused systems are a balance of row-oriented and
column-oriented.
Compairsion between row-oriented and colum-oriented
1. Column-oriented systems are more efficient when an aggregate needs to be computed over many rows
but only for a notably smaller subset of all columns of data, because reading that smaller subset of data
can be faster than reading all data.
2. Column-oriented systems are more efficient when new values of a column are supplied for all rows at
once, because that column data can be written efficiently and replace old column data without touching
any other columns for the rows.
3. Row-oriented systems are more efficient when many columns of a single row are required at the same
time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
4. Row-oriented systems are more efficient when writing a new row if all of the column data is
supplied at the same time, as the entire row can be written with a single disk seek.
Compression
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations
available in column-oriented data that are not available in row-oriented data.
5. Current examples of column-oriented DBMSs :
* Calpont's InfiniDB Community Edition, MySQL-front end, GPLv2
* C-Store No new release since Oct 2006
* GenoByte Column based storage system and API for handling genotype data
* Lemur Bitmap Index C++ Library (GPL)
* FastBit
* Infobright Community Edition, regular updates .
* LucidDB and Eigenbase
* MonetDB academic project
* Metakit
* The S programming language and GNU R incorporate column-oriented data structures for
statistical analyses.
Only InfiniDB and Infobright is based on MySql and suitable for pepoles with mysql
skills.
Others are academic projects or only a library.
Infobright is the best partner of mysql in 2009 and have bigger community.
6. Infobright Community Edition Key benefits
Ideal for data volumes of 500GB to 30TB.
Market-leading data compression (from 10:1 to over 40:1),
which drastically reduces I/O (improving query performance) and results in significantly less storage than
alternative solutions.
No licensing fees.
Fast response times for complex analytic queries.
Query and load performance remains constant as the size of the database grows .
No requirement for specific schemas, e.g. Star schema .
No requirement for materialized views, complex data partitioning strategies, or indexing .
Simple to implement and manage, requiring little administration Reduction in data warehouse capital and
operational expenses by reducing the number of servers, the amount of storage needed and
their associated maintenance costs, and a significant reduction in administrative costs .
Runs on low cost, off-the-shelf hardware .
Is compatible with major Business Intelligence tools such as Pentaho,JasperSoft, Cognos, Business Objects,
and others .
7. Infobright Architecture
1 . Column Orientation
2 . Data Packs and Data Pack Nodes
3 . Knowledge Nodes and the Knowledge Grid
4 . The Optimizer
5 . Data Compression
1. Column Orientation
Infobright at its core is a highly compressed column-oriented data store, which means that
instead of the data being stored row by row, it is stored column by column.
There are many advantages to column-orientation, including the ability to do more efficient
data compression because each column stores a single data type (as opposed to rows that
typically contain several data types), and allows compression to be optimized for each
particular data type, significantly reducing disk I/O .
Most analytic queries only involve a subset of the columns of the tables and so a column
oriented database focuses on retrieving only the data that is required.
8. 2. Data Organization and the Knowledge Grid
Infobright organizes the data into 3 layers:
• Data Packs
The data itself within the columns is stored in 65,536 item groupings
called Data Packs. The use of Data Packs improves data compression
since they are smaller subsets of the column data (hence less variability)
and the compression algorithm can be applied based on data type.
• Data Pack Nodes (DPNs)
Data Pack Nodes contain a set of statistics about the data that is stored and compressed in
each of the Data Packs. There is always a 1 to 1 relationship between Data Packs and DPNs.
DPN’s always exist, so Infobright has some information about all the data in the database,
unlike traditional databases where indexes are created for only a subset of Columns.
• Knowledge Nodes
These are a further set of metadata related to Data Packs or column relationships. They can
be more introspective on the data, describing ranges of value occurrences, or can be
extrospective, describing how they relate to other data in the database.
Most KN’s are created at load time,but others are created in response to queries in order to
optimize performance.
This is a dynamic process, so certain Knowledge Nodes may or may not exist at a particular
point in time.
The DPNs and KNs form the Knowledge Grid. Unlike traditional database indexes, they are not
manually created, and require no ongoing care and feeding. Instead, they are created and
managed automatically by the system. In essence, they create a high level view of the entire
content of the database.
9. 3. The Infobright Optimizer
The Optimizer is the highest level of intelligence in the architecture. It uses the Knowledge Grid
to determine the minimum set of Data Packs, which need to be decompressed in order to satisfy
a given query in the fastest possible time.
10. 4 . Resolving Complex Analytic Queries without Indexes
The Infobright data warehouse resolves complex analytic queries without the need for traditional
indexes.
Data Packs
As noted, Data Packs consist of groupings of 65,536 items within a given column.
For example, for the table T with columns A, B and 300,000 records, Infobright would have the
following Packs:
Pack A1: values of A for rows no. 1-65,536
Pack A2: values of A for 65,537-131,072
Pack A3: values of A for 131,073-196,608
Pack A5: values of A for 262,145-300,000
Pack A4: values of A for 196,609-262,144
Pack B1: values of B for rows no. 1-65,536
Pack B2: values of B for 65,537-131,072
Pack B3: values of B for 131,073-196,608
Pack B4: values of B for 196,609-262,144
Pack B5: values of B for 262,145-300,000
11. The Knowledge Grid
The Infobright Knowledge Grid includes Data Pack Nodes and Knowledge Nodes.
For example, for the above table T, assume that both A and B store some numeric values.
The following table should be read as follows:
for the first 65,536 rows in T the minimum value of A is 0, maximum is 5, the sum of values on A
for the first 65,536 rows is 100,000 (and there are no null values).
12. DPNs are accessible without the need to decompress the corresponding Data Packs .
Knowledge Nodes (KNs) were developed to efficiently deal with complex, multiple-table
queries (joins, sub-queries, etc.).
To process multi-table join queries and sub-queries, the Knowledge Grid uses multi-table
Pack-To-Pack KNs that indicate which pairs of Data Packs from different tables should actually
be considered while joining the tables.
KN’s can be compared to indexes used by traditional databases, however, KN’s work on Packs
instead of rows. Therefore KNs are 65,536 times smaller than indexes (or even 65,536 times
65,536 for the Pack-To-Pack Nodes because of the size decrease for each of the two tables
involved).
In general the overhead is around 1% of the data, compared to classic indexes, which can be
20-50% of the size of the data.
Knowledge Nodes are created on data load and may also be created during query. They are
automatically created and maintained by the Knowledge Grid Manager based on the column
type and definition, so no intervention by a DBA is necessary.
13. The Infobright Optimizer
The Optimizer applies DPNs and KNs for the purpose of splitting Data Packs among the three
following categories for every query coming into the optimizer:
• Relevant Packs – in which each element (the record’s value for the given column) is identified,
based on DPNs and KNs, as applicable to the given query.
• Irrelevant Packs – based on DPNs and KNs, the Pack holds no relevant values.
• Suspect Packs – some elements may be relevant, but there is no way to claim that the Pack is
either fully relevant or fully irrelevant,based on DPNs and KNs.
While querying, Infobright does not need to decompress either Relevant or Irrelevant Data Packs.
Irrelevant Packs are simply not taken into account at all. In case of Relevant Packs, Infobright
knows that all elements are relevant, and the required answer is obtainable.
Query : SELECT SUM(B) FROM T WHERE A > 6;
14. Packs A1, A2, A4 are Irrelevant – none of the rows can satisfy A > 6 because all these packs
have maximum values below 6. Consequently, Packs B1, B2, B4 will not be analyzed while
calculating SUM(B) – they are Irrelevant too.
Pack A3 is Relevant – all the rows with numbers 131,073-196,608 satisfy A > 6. It means Pack
B3 is Relevant too. The sum of values on B within Pack B3 is one of the components of the final
answer. Based on B3’s DPN,Infobright knows that that sum equals to 100. And this is everything
Infobright needs to know about this portion of data.
Pack A5 is Suspect – some rows satisfy A > 6 but it is not known which ones. As a
consequence, Pack B5 is Suspect too. Infobright will need to decompress both A5 and B5 to find,
which rows out of 262,145-300,000 satisfy A > 6 and sum up together the values over B precisely
for those rows. A result will be added to the value of 100 previously obtained for Pack B3, to form
the final answer to the query.
15. 5. Data Loading and Compression
During load, 65,536 values of the given column are treated as a sequence with zero or more null
values occurring anywhere in the sequence.
Information about the null positions is stored separately (within the Null Mask). Then the
remaining stream of the non-null values is compressed, taking full advantage of regularities
inside the data.
16. With Infobright, a 1TB database becomes a 100GB database. Since the data is much smaller
the disk transfer rate is improved (even with the overhead of compression).
One of Infobright’s benefits is its industry-leading compression. Unlike traditional row-based
data warehouses, data is stored by column, allowing compression algorithms to be finely tuned
to the column data type.
Moreover, for each column, the data is split into Data Packs with each storing up to 65,536
values of a given column. Infobright then applies a set of patent-pending compression
algorithms that are optimized by automatically self-adjusting various parameters of the
algorithm for each data pack.
An average compression ratio of 10:1 is achieved in Infobright. For example 10TB of raw data
can be stored in about 1TB of space on average (including the overhead associated with Data
Pack Nodes and the Knowledge Grid).
Within Infobright, the compression ratio may differ depending on data types and content.
Additionally, some data may turn out to be more repetitive than others and hence compression
ratios can be as high as 40:1.
17. 6. How Infobright Leverages MySQL
In the data warehouse marketplace, the database must integrate with a variety of drivers and
tools. By integrating with MySQL, Infobright leverages the extensive driver connectivity
provided by MySQL connectors (C, JDBC,ODBC, .NET, Perl, etc.).
MySQL also provide cataloging functions, such as table definitions, views, users, permissions,
etc. These are stored in a MyISAM database.
Although MySQL provides a storage engine interface which is implemented in Infobright, the
Infobright technology has its own advanced Optimizer. This is required because of the column
orientation, but also because of the unique Knowledge Node technology described above.
18. Conclusion
Infobright has developed an architecture designed for analytic data warehousing, which also
requires a lot less work by IT organizations, delivering faster time-to-market for analytic
Applications .
The open source Infobright Community Edition makes it affordable for companies of all sizes.
Almost half of IT directors are prevented from rolling out the BI initiatives they desire because
Of expensive licensing and specialized hardware costs, Infobright can help by supplying the
low TCO that IT executives are looking for.