BIG DATA-MART
using NoSQL key-value stores
Chethiya Galkaduwa | Praveen Bhawantha
Introduction
Data gathering
Companies collect data from various channels for strategic goals and competitive
advantage
Data warehousing challenges
Big data analysis requires alternative database management systems
NoSQL databases
Offer advantages like horizontal scaling, flexible data types, and fast data access
Data mart
Sub-element of a data warehouse, used by specific departments
Methodology
Multidimensional Schema
The multidimensional schema is the core of data mart
architecture.
Formalization
Dimensions store attributes that describe the object in
the fact.
can be flat or snowflaked, composed of hierarchies
and descriptive textual values.
Line Order serves as a fact with measures.
Dimensions include Customer, Product, and order
date, enabling data aggregation through hierarchies.
resembles a logical tree structure with parent-child
relationships.
● LineOrder : Fact with measures (Quantity,
Tax, Discount)
● Dimensions : Customer, Product, Order Date
● Hierarchies define data aggregation
● Example : Order Date dimension hierarchy
aggregates data from week to month to
year
● Data organization resembles a logical tree
structure
Multidimensional Conceptual Schema
Formalization
Key-Value store
A key-value store (KV) is a fundamental concept in NoSQL databases.
It is represented by a collection of ordered pairs (K × V) containing unique keys and corresponding values.
The key uniquely references a single value and can be of any data structure, with possible restrictions imposed by
specific database management systems.
NoSQL databases have no limitations on the structure and data type of the stored values, allowing flexibility for
JSON documents, text, or embedded key-value pairs. Key/value pairs are denoted using "{}" with nested records,
and the symbol " " maps a key to its associated value. The value is enclosed in "[]".
⇒
An example of a key-value pair schema is illustrated, where a "Person" is represented with a nested record named
"Address".
Formalization
Approach Overview
Flat Logical Approach
Stores fact and associated
dimensions in a single table with
embedded records, optimizing
queries with fewer joins at the
expense of storage space.
Hierarchical Logical
Approach
Separates fact and dimensions into
separate tables (fact table and child
tables).
Exploded Hierarchical Logical
Approach
Is an extension of HLA, exploding
dimension tables to represent levels
in the dimensional hierarchy.
To instantiating a Big Data Mart with Key-Value NoSQL three approaches are considered,
● FLA (Flat Logical Approach)
● HLA (Hierarchical Logical Approach)
● EHLA (Exploded Hierarchical Logical Approach).
These models provide a SQL-like table structure on key-value stores for SQL-type querying.
NoSQL databases lack join support, but Oracle NoSQL Database allows joins within the same hierarchy.
Transformation rules help for setting up a big data warehouse under NoSQL key-value database.
Three logical models are considered,
● FLM(Flat Logical Model) -> represent full data
denormalization
● SHLM(Star Hierarchical Logical Model) -> follows the star
hierarchy
● SnHLM(Snow Hierarchical Logical Model) -> adopts the snow hierarchy
FLM (Flat Logical Model)
● Consists of a single two-dimensional array of data elements.
● Fact and associated dimensions are transformed into a key-
value data table.
● Key-value table structure: T(IdT
, AttT
, RT
).
● IdT
represents the fact key, AttT
includes simple attributes, and
RT
stores nested records.
Transformation Rules
Example:
SHLM (Star Hierarchical Logical Model)
● Organizes fact and dimensions into a tree-like structure using parent/child relationships.
● Mapping to key-value pairs: Fact key with measures and dimension tables with attributes.
SnHLM (Snow Hierarchical Logical Model)
● Extends the SHLM by connecting dimensions to multiple other dimensions.
● Replicates dimension child tables for each parent fact table.
These logical models provide different structures for implementing a data mart using key-value NoSQL databases.
They offer flexibility and optimization opportunities based on the specific requirements and relationships of the
data.
Transformation Rules
Example:
Example:
Implementation
Experiment Overview
● The experiments aim to validate the proposed approaches for implementing a big data mart using key-value
NoSQL databases.
● Two experiments conducted: one to measure read request latency and another to evaluate storage space per
logical model.
● Analytical queries are used to assess the models' performance and aggregation capabilities involving
different dimensions.
Implementation
Oracle NoSQL Database
Overview of Oracle NoSQL Database:
● NoSQL-type key-value database
● Horizontal scalability across multiple shards
● Supports JSON, SQL-like table, and key-value data types
● Features: parent-child join, aggregation functions, parallel scans, column indexing
● Available editions: Basic Edition (BE), Enterprise Edition (EE), Community Edition (CE)
In the below experiments Community Edition: Apache 2.0 license is used
Implementation
Environment Setting
Data Generation:
● Based on TCP-H benchmark for decision support systems
● Custom queries for basic OLAP operations
● Data generated in JSON format using KoalaBench project
Software Setting:
● Docker engine used for deploying Oracle NoSQL database
● Two setups: single node and 3x1 cluster with Docker Swarm
● Intel Xeon w3530 with 8 GB RAM on host machine
Implementation
Experiments
Experiment 1 : Query Execution Time
● In the experiment it study the query execution
time in a 3-node cluster with scale factors sf=1
and sf=10 and evaluate database performance
by model using HiveQL.
● Observation: FLM performs better due to no
need for joins in aggregations.
● SnHLM has lower query performance compared
to SHLM, especially at larger scale factors.
Experiment 2 : Storage Space
● In the experiment it study the storage space allocated per
data model.
● Oracle NoSQL database is used for monitoring disk usage
and managing storage directory size.
● SnHLM occupies about 3 times more space due to high
data redundancy.
● FLM offers optimal disk space but requires more
maintenance efforts with multiple attributes in one table.
Implementation
● Studied implementation of big data mart under NoSQL database
● Three logical models: FLM, SHLM and SnHLM.
● Weaknesses and strengths across the models following two metrics: storage space and query performance.
● FLM demonstrated better performance according to the results.
References
● A. KHALIL and M. BELAISSAOUI, “New approach for implementing big datamart using NoSQL key-value
stores,” in 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies
and Applications (CloudTech), Nov. 2020, pp. 1–6. doi: 10.1109/CloudTech49835.2020.9365897.
Conclusion
THANK YOU!

BIG DATA-MART using NoSQL key-value stores -2.pptx

  • 1.
    BIG DATA-MART using NoSQLkey-value stores Chethiya Galkaduwa | Praveen Bhawantha
  • 2.
  • 3.
    Data gathering Companies collectdata from various channels for strategic goals and competitive advantage Data warehousing challenges Big data analysis requires alternative database management systems NoSQL databases Offer advantages like horizontal scaling, flexible data types, and fast data access Data mart Sub-element of a data warehouse, used by specific departments
  • 4.
  • 5.
    Multidimensional Schema The multidimensionalschema is the core of data mart architecture. Formalization Dimensions store attributes that describe the object in the fact. can be flat or snowflaked, composed of hierarchies and descriptive textual values. Line Order serves as a fact with measures. Dimensions include Customer, Product, and order date, enabling data aggregation through hierarchies. resembles a logical tree structure with parent-child relationships.
  • 6.
    ● LineOrder :Fact with measures (Quantity, Tax, Discount) ● Dimensions : Customer, Product, Order Date ● Hierarchies define data aggregation ● Example : Order Date dimension hierarchy aggregates data from week to month to year ● Data organization resembles a logical tree structure Multidimensional Conceptual Schema Formalization
  • 7.
    Key-Value store A key-valuestore (KV) is a fundamental concept in NoSQL databases. It is represented by a collection of ordered pairs (K × V) containing unique keys and corresponding values. The key uniquely references a single value and can be of any data structure, with possible restrictions imposed by specific database management systems. NoSQL databases have no limitations on the structure and data type of the stored values, allowing flexibility for JSON documents, text, or embedded key-value pairs. Key/value pairs are denoted using "{}" with nested records, and the symbol " " maps a key to its associated value. The value is enclosed in "[]". ⇒ An example of a key-value pair schema is illustrated, where a "Person" is represented with a nested record named "Address". Formalization
  • 8.
    Approach Overview Flat LogicalApproach Stores fact and associated dimensions in a single table with embedded records, optimizing queries with fewer joins at the expense of storage space. Hierarchical Logical Approach Separates fact and dimensions into separate tables (fact table and child tables). Exploded Hierarchical Logical Approach Is an extension of HLA, exploding dimension tables to represent levels in the dimensional hierarchy. To instantiating a Big Data Mart with Key-Value NoSQL three approaches are considered, ● FLA (Flat Logical Approach) ● HLA (Hierarchical Logical Approach) ● EHLA (Exploded Hierarchical Logical Approach). These models provide a SQL-like table structure on key-value stores for SQL-type querying. NoSQL databases lack join support, but Oracle NoSQL Database allows joins within the same hierarchy.
  • 9.
    Transformation rules helpfor setting up a big data warehouse under NoSQL key-value database. Three logical models are considered, ● FLM(Flat Logical Model) -> represent full data denormalization ● SHLM(Star Hierarchical Logical Model) -> follows the star hierarchy ● SnHLM(Snow Hierarchical Logical Model) -> adopts the snow hierarchy FLM (Flat Logical Model) ● Consists of a single two-dimensional array of data elements. ● Fact and associated dimensions are transformed into a key- value data table. ● Key-value table structure: T(IdT , AttT , RT ). ● IdT represents the fact key, AttT includes simple attributes, and RT stores nested records. Transformation Rules Example:
  • 10.
    SHLM (Star HierarchicalLogical Model) ● Organizes fact and dimensions into a tree-like structure using parent/child relationships. ● Mapping to key-value pairs: Fact key with measures and dimension tables with attributes. SnHLM (Snow Hierarchical Logical Model) ● Extends the SHLM by connecting dimensions to multiple other dimensions. ● Replicates dimension child tables for each parent fact table. These logical models provide different structures for implementing a data mart using key-value NoSQL databases. They offer flexibility and optimization opportunities based on the specific requirements and relationships of the data. Transformation Rules Example: Example:
  • 11.
  • 12.
    Experiment Overview ● Theexperiments aim to validate the proposed approaches for implementing a big data mart using key-value NoSQL databases. ● Two experiments conducted: one to measure read request latency and another to evaluate storage space per logical model. ● Analytical queries are used to assess the models' performance and aggregation capabilities involving different dimensions. Implementation
  • 13.
    Oracle NoSQL Database Overviewof Oracle NoSQL Database: ● NoSQL-type key-value database ● Horizontal scalability across multiple shards ● Supports JSON, SQL-like table, and key-value data types ● Features: parent-child join, aggregation functions, parallel scans, column indexing ● Available editions: Basic Edition (BE), Enterprise Edition (EE), Community Edition (CE) In the below experiments Community Edition: Apache 2.0 license is used Implementation
  • 14.
    Environment Setting Data Generation: ●Based on TCP-H benchmark for decision support systems ● Custom queries for basic OLAP operations ● Data generated in JSON format using KoalaBench project Software Setting: ● Docker engine used for deploying Oracle NoSQL database ● Two setups: single node and 3x1 cluster with Docker Swarm ● Intel Xeon w3530 with 8 GB RAM on host machine Implementation
  • 15.
    Experiments Experiment 1 :Query Execution Time ● In the experiment it study the query execution time in a 3-node cluster with scale factors sf=1 and sf=10 and evaluate database performance by model using HiveQL. ● Observation: FLM performs better due to no need for joins in aggregations. ● SnHLM has lower query performance compared to SHLM, especially at larger scale factors. Experiment 2 : Storage Space ● In the experiment it study the storage space allocated per data model. ● Oracle NoSQL database is used for monitoring disk usage and managing storage directory size. ● SnHLM occupies about 3 times more space due to high data redundancy. ● FLM offers optimal disk space but requires more maintenance efforts with multiple attributes in one table. Implementation
  • 16.
    ● Studied implementationof big data mart under NoSQL database ● Three logical models: FLM, SHLM and SnHLM. ● Weaknesses and strengths across the models following two metrics: storage space and query performance. ● FLM demonstrated better performance according to the results. References ● A. KHALIL and M. BELAISSAOUI, “New approach for implementing big datamart using NoSQL key-value stores,” in 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech), Nov. 2020, pp. 1–6. doi: 10.1109/CloudTech49835.2020.9365897. Conclusion
  • 17.

Editor's Notes

  • #1 Good Afternoon Madam! Today we are going to do a presentation regarding the research summary we did in the research paper of BIG-DATA-MART using NoSQL Key values stores. I’m praveen/chethiya and chehiya/praveen is my other team mate of this presentation.
  • #2 So, let's begin by understanding the fundamental concepts and objectives behind this research. Shall we move on to the next slide?
  • #3 In this section, we'll delve deeper into the key aspects of our topic, "Big Data-Mart using NoSQL Key-Value Stores." Let's start by understanding the concept of a data mart. A data mart is a sub-element of a data warehouse that serves the specific needs of individual departments within an organization. It allows users to access and analyze relevant data tailored to their requirements. Moving on, we'll explore the data gathering process. Companies today collect vast amounts of data from various sources, ranging from customer interactions to operational metrics. This data gathering is essential to achieve strategic goals and maintain a competitive edge in the market. As we proceed, we'll discuss the challenges posed by traditional data warehousing approaches when dealing with big data. With the exponential growth of data, traditional relational databases face scalability and performance issues, leading to the need for alternative database management systems. Lastly, we'll introduce NoSQL databases and their significance in the big data landscape. NoSQL databases offer distinct advantages, such as horizontal scaling, support for flexible data types, and faster data access. These features make them a promising choice for building data marts that can efficiently handle large volumes of diverse data. Let's continue to the next slide, where we'll explore the different approaches to building a big data mart using NoSQL key-value stores.
  • #4 Now, let's delve into the heart of our presentation - the Methodology. We will take you through the systematic approach we followed to implement a big data mart using NoSQL key-value stores. This methodology has been designed to ensure efficiency and effectiveness in handling large-scale data, so you can expect some valuable insights into our approach. Let's get started!
  • #5 It includes one or more fact data structure indexing a set of associated dimensions, and is widely used to build data warehouses and dimensional data marts. Formally a MS is defined by: Add (Quantity, Tax, Discount) after Line Order serves as a fact with measures. Praveen added this (additionaly we need to talk about these formulas) In the Formalization of our methodology, we encounter the crucial concept of the Multidimensional Schema, which lies at the very heart of data mart architecture. This schema includes the function MS and various formulas, contributing to the structure and relationships between facts and dimensions. (explain the formulas) Dimensions play a pivotal role as they store attributes describing the objects in the fact, with options for being flat or snowflaked, and composed of hierarchies with descriptive textual values. Explain the 2nd formulas In our case, Line Order acts as the fact, featuring measures, while dimensions such as Customer, Product, and order date enable data aggregation through hierarchies, forming a logical tree structure with parent-child relationships. Let's dive deeper into this fundamental aspect of our approach and explore how it lays the groundwork for our big data mart implementation.
  • #6 Continuing with the Formalization, the next aspect we delve into is the Multidimensional Conceptual Schema. In this schema, we encounter the LineOrder as the fact, equipped with essential measures such as Quantity, Tax, and Discount. Complementing the fact, we have three dimensions: Customer, Product, and Order Date. These dimensions are vital in providing descriptive attributes that help analyze the data effectively. Hierarchies play a significant role as they define data aggregation. An example of this is seen in the Order Date dimension hierarchy, where data is aggregated from week to month to year, facilitating a deeper understanding of the information. This data organization exhibits a logical tree structure, forming parent-child relationships among members. As we progress with our Formalization, we gain deeper insights into the structures that enable us to build a robust big data mart using NoSQL key-value stores. (About the picture) In this slide, we have a visual representation that further explains the Multidimensional Conceptual Schema. The picture depicts the relationships between the fact, LineOrder, and its associated dimensions: Customer, Product, and Order Date. The fact, represented as a table, contains the measures Quantity, Tax, and Discount. The dimensions are also shown as tables, each with their respective attributes. The hierarchies within the Order Date dimension are illustrated, showcasing the data aggregation from the week level to the month level and finally to the year level. This graphical representation helps us visualize the logical tree structure formed by the parent-child relationships, providing a clear understanding of how the data is organized within the big data mart. It highlights the crucial role of dimensions and hierarchies in enabling effective data analysis and decision-making in the context of NoSQL key-value stores.
  • #7 In this slide, we delve into the concept of "Key-Value Store," which is a crucial aspect of NoSQL databases. A Key-Value Store (KV) is essentially a collection of ordered pairs (K × V) consisting of unique keys and their corresponding values. The key serves as a unique reference to a single value and can be of any data structure, although certain database management systems might impose restrictions on key data types. NoSQL databases offer great flexibility for the structure and data type of stored values. This allows for various types of data, including JSON documents, text, or even embedded key-value pairs. Key/value pairs are denoted using "{}" with nested records, and the symbol "⇒" is used to map a key to its associated value, which is enclosed in "[]". An example of a key-value pair schema is provided, illustrating how a "Person" is represented with a nested record named "Address." This slide highlights the importance of the Key-Value Store model in the context of implementing a big data mart using NoSQL databases, as it enables efficient and flexible data storage and retrieval for analytical purposes.
  • #8 In this slide, we present an overview of the three approaches we propose to instantiate a Big Data Mart using Key-Value NoSQL databases. These approaches are: FLA (Flat Logical Approach): This approach involves storing the fact and its associated dimensions in a single table, using an embedded record data structure. It promotes query optimization by reducing the number of joins, but it may lead to increased storage space due to data redundancy. HLA (Hierarchical Logical Approach): In this model, the fact and dimensions are stored in separate tables, similar to classical data warehouses. It allows for efficient data organization and querying. EHLA (Exploded Hierarchical Logical Approach): This approach is an extension of HLA, where each point of the star explodes into more points, representing levels in the dimensional hierarchy. Dimension tables are exploded, leading to increased data disk space at the cost of query performance and response time. It's important to note that while NoSQL databases typically lack support for joins among different tables, Oracle NoSQL Database allows joins within the same hierarchy. This slide provides an overview of the different approaches available for implementing a Big Data Mart using Key-Value NoSQL databases, highlighting their advantages and considerations for efficient data storage and querying.
  • #9 In the context of Transformation Rules for setting up a Big Data Warehouse under a NoSQL key-value database, this study discuss about three logical models: FLM , SHLM and SnHLM. Focusing on the Flat Logical Model (FLM): FLM (Flat Logical Model): The Flat Logical Model is characterized by a single two-dimensional array of data elements. In this approach, the fact data and its associated dimensions are transformed into a key-value data table. The structure of this key-value table is denoted as T(IdT, AttT, RT). - IdT represents the fact key, uniquely identifying each fact record in the table. - AttT includes simple attributes that can be foreign keys referencing dimension keys or measures. - RT stores nested records that contain dimension data elements. In simpler terms, FLM represents a straightforward and efficient way of organizing data in a key-value NoSQL database. All relevant data is stored in a single table, which allows for easy querying and retrieval of information. This approach eliminates the need for complex joins, making query optimization more straightforward. By adopting the FLM approach, organizations can have benefits from improved query performance and simplified data structure management. Overall, FLM is a powerful choice for designing a Big Data Mart under the Key-Value NoSQL model.
  • #10 Let's begin with the Star Hierarchical Logical Model (SHLM), which uses a tree-like structure based on parent/child relationships. Understanding SHLM: SHLM organizes the fact and dimension tables into a hierarchical structure, resembling a tree. In this approach, the fact table is represented as a key-value pair, with the fact key and its corresponding measures. Each dimension is also mapped to a child key-value pair, inheriting the key of its matching fact record and containing dimension attributes. In this example you can see that, the Supplier dimension is represented as a separate key-value pair, and it is connected to the LineOrder fact table using a parent/child relationship. The LineOrder.Supplier key-value pair holds the supplier information related to the LineOrder record. This approach follows a tree-like structure with parent/child relationships —------------------------------------------------- When its comes to the Snow Hierarchical Logical Model (SnHLM) this model is built upon the concept of the Star Hierarchical Logical Model (SHLM). SnHLM takes data organization to the next level by enabling more intricate relationships between dimensions and facilitating advanced data aggregation. In SnHLM, dimensions are connected to multiple other dimensions, creating complex relationships. So This connectivity allows for a more flexible and versatile representation of data. In the Snow Hierarchical Logical Model example, Supplier dimension is extended to the Nation dimension. This is achieved by replicating the Supplier child table for each parent fact table, creating a one-to-many relationship. This approach allows for even more complex relationships and data organization, but there is a disadvantage that it might increase data disk space due to the duplication of dimension tables.
  • #11 Okay let's dive into the Implementation part.
  • #12 To evaluate query performance and storage consumption in a NoSQL key-value database, the study mentions about two experiments that they carried out. Main objective of these experiments was to validate the proposed approaches for creating a big data mart, with the help of key-value NoSQL databases. Experiment 1: Measuring Read Request Latency The first experiment focused on evaluating read request latency for various dimensions. In this experiment it aimed to understand how quickly the database responds to read requests and how different logical models perform in this context. Experiment 2: Evaluating Storage Space per Logical Model In the second experiment, it checks how much storage space each logical model requires, considering different scale factors. This allowed to see how data size impacts storage needs for each model. These tests revealed information about the system's query performance and storage effectiveness. To evaluate the models three analytical queries are used. These queries gradually include dimensions in their computation.
  • #13 focusing on the Oracle NoSQL Database, a key component of this experiments in implementing the big data mart. The Oracle NoSQL Database is a highly versatile key-value database, offering an array of powerful features for data management. Support for Various Data Types: One of the strengths of the Oracle NoSQL Database is its extensive support for diverse data types. Since It can handle JSON data, SQL-like table structures, and traditional key-value formats. This database stands out with its ability to perform parent-child joins and execute aggregation operations. These capabilities enables this study to analyze complex relationships and aggregate data efficiently. Accessible Editions: For this experiments, it chooses the Community Edition (CE) of the Oracle NoSQL Database.
  • #14 Well moving on to the Environment Setting that played a crucial role in this experiments for implementing the big data mart. Slide Content: Having a well-structured environment is essential to ensure reliable and accurate results in this experiments. Data Generation: To evaluate how well the data modeling performs, this study has used a well-known TCP-H benchmark[3]. For generating the necessary data, study relied on KoalaBench, which we modified to suit our specific meta data model. Study uses of DBGen, an easily accessible tool available on GitHub, to generate the data. To import the JSON files into the Oracle NoSQL Database, we used an import function mentioned in the README file[4]. Software Settings: This experiment took place using the Oracle NoSQL Database within the Docker engine. And it sets up two different environments for comparison: a single node setup and an expanded 3x1 cluster, both orchestrated using Docker Swarm. The host machine, equipped with an Intel Xeon w3530 processor and 8 GB of RAM, ensured they had sufficient resources for smooth experiment execution.
  • #15  Going over the results from the experiments on implementing the big data mart using the Oracle NoSQL Database. The experiments focused on two crucial aspects: Query Execution Time and Storage Space allocation for each data model. Experiment 1: Query Execution Time In this experiment, it studies how quickly the database responded to queries in a 3-node cluster, using scale factors sf=1 and sf=10. After that it evaluates the performance of each data model using HiveQL, a query language widely used in Hadoop-based systems. Observations: The Flat Logical Model (FLM) performed exceptionally well, mainly because it doesn't require complex joins for aggregations. On the other hand, the Snow Hierarchical Logical Model (SnHLM) showed lower query performance compared to the Star Hierarchical Logical Model (SHLM), especially at larger scale factors. Experiment 2: Storage Space In this experiment, it investigates how much storage space each data model required in the Oracle NoSQL Database. After that the study closely monitored disk usage and managed storage directory size for accurate analysis. Observations: The SnHLM consumed approximately three times more storage space compared to other models, primarily due to high data redundancy. The Flat Logical Model (FLM) offered optimal disk space utilization, but it does require more maintenance efforts due to multiple attributes in one table. The Fig.5 shows disk space consumed by each logical model on 2 scale factors (sf=1, sf=10), we observe that SnHLM occupies more disk space when compared to SHLM and FLM (about respectively 3 times more space) due to high data redundancy
  • #16 In conclusion, we studied the implementation of a big data mart under a NoSQL database, exploring three logical models: FLM, SHLM, and SnHLM. Key Findings: Evaluating both the storage space and query performance, Flat Logical Model demonstrated better overall performance. Looking ahead, future work involves: A comparative study with relational databases and other NoSQL models to identify the best fit for the use cases discussed in the study.. Exploring the transformation process from relational to key-value databases for seamless integration.