This document discusses moving from batch processing of data using MapReduce jobs to real-time event-driven collection of analytical data. It analyzes data sets from a company called Kivra to determine if a traditional SQL database could handle the scale needed. Tests show the technique of using an asynchronous logging system to insert real-time event data into a SQL database can efficiently handle over 1 million insertions per hour per database node while still providing historical and aggregated data for analysis. The resulting system provides improved scalability over the previous batch-based approach.
This document provides an introduction to Netezza fundamentals for application developers. It describes Netezza's Asymmetric Massively Parallel Processing architecture, which uses an array of servers called S-Blades connected to disks and database accelerator cards to process large volumes of data in parallel. The document aims to help readers quickly understand and use the Netezza appliance through explanations of its components and query processing. It also defines key Netezza terminology and objects.
NENUG Apr14 Talk - data modeling for netezzaBiju Nair
This document discusses considerations for data modeling on Netezza appliances to optimize performance. It recommends distributing data uniformly across snippet processors to maximize parallel processing. When joining tables, the distribution key should match join columns to keep processors independent. Zone maps and clustered tables can reduce data reads from disk. Materialized views on frequently accessed columns further improve performance for single table and join queries.
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Nandhitha B
This document provides an introduction to YARN (Yet Another Resource Negotiator), the resource management platform for Hadoop 2.0. It discusses the core components of YARN: the ResourceManager, which allocates cluster resources; the NodeManager, which manages resources on individual nodes; and the ApplicationMaster, which coordinates each application's execution. Containers are allocated from NodeManagers to ApplicationMasters and represent a collection of physical resources like RAM and CPU cores. YARN allows different data processing methods like batch, interactive, streaming and graph processing to run on data stored in HDFS and opens up Hadoop to various distributed applications beyond MapReduce.
Parameter substitution in Aginity WorkbenchMary Uguet
This document discusses parameter substitution in Aginity Workbench, which allows developers to write SQL queries and scripts that include parameters. This enables running queries with different filters, date ranges, or table and column names by prompting for parameter values when executing. The feature streamlines testing queries with multiple variable values by avoiding multiple find-and-replace operations. Parameters are defined using a $ prefix, and the user is prompted to supply a value and data type when running a query containing parameters.
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Nandhitha B
YARN allows different data processing methods like batch, interactive, graph, and stream processing to run on data stored in HDFS. It enables tools like Spark, Hive, and HBase to perform distributed applications beyond MapReduce. The core YARN components are the Resource Manager, which allocates cluster resources, the Node Manager, which manages containers on its node, and the Application Master, which coordinates each application's execution. Containers are allocated resources like CPU and memory on individual nodes. HDFS addresses challenges of big data storage by distributing and replicating data blocks across multiple nodes, allowing horizontal scaling. It can store structured, semi-structured, and unstructured data. YARN and MapReduce process data in parallel across nodes
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
Dear Student,
Greetings from www.etraining.guru
We provide BEST online training in Hyderabad for IBM Netezza DBA and/or Development by a senior working professional. Our Netezza Trainer comes with a working experience of 10+ years, 6+ years in Netezza and an Netezza 7.1 certified professional.
DBA Course Content: http://www.etraining.guru/course/dba/online-training-ibm-netezza-puredata-dba
Development Course Content: http://www.etraining.guru/course/ibm/online-training-ibm-puredata-netezza-development
Course Cost: USD 300 (or) INR 18000
Number of Hours: 24 hours
*Please note the course also includes Netezza certification assitance.
If there is any opportunity, we will be very happy to serve you. Appreciate if you can explore other training opportunities in our website as well.
We can be reachable at info@etraining.guru (or) 91-996-669-2446 for any further info/details.
Regards,
Karthik
www.etraining.guru"
The document discusses using R for analytics on Netezza's TwinFin appliance. TwinFin is a massively parallel processing database management system designed specifically for performance. It utilizes field programmable gate arrays and an "on-stream analytics" approach. The document outlines how R interfaces with TwinFin through functions like nzapply and nztapply that allow running R functions on TwinFin's distributed data in parallel. It provides examples of building decision trees and linear models on TwinFin tables using these functions.
This white paper discusses Oracle to Netezza migration for a Fortune 100 retailer. It describes the key steps in the migration process including impact analysis, design and development, history load, and testing. Impact analysis identifies all database objects, ETL processes, and applications/reports impacted. Design considerations include data type mapping, SQL conversion, and report changes. History data can be loaded via flat files or ETL. Rigorous testing of database objects, SQL, ETL processes, and data is recommended to identify any issues.
This document provides an introduction to Netezza fundamentals for application developers. It describes Netezza's Asymmetric Massively Parallel Processing architecture, which uses an array of servers called S-Blades connected to disks and database accelerator cards to process large volumes of data in parallel. The document aims to help readers quickly understand and use the Netezza appliance through explanations of its components and query processing. It also defines key Netezza terminology and objects.
NENUG Apr14 Talk - data modeling for netezzaBiju Nair
This document discusses considerations for data modeling on Netezza appliances to optimize performance. It recommends distributing data uniformly across snippet processors to maximize parallel processing. When joining tables, the distribution key should match join columns to keep processors independent. Zone maps and clustered tables can reduce data reads from disk. Materialized views on frequently accessed columns further improve performance for single table and join queries.
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Nandhitha B
This document provides an introduction to YARN (Yet Another Resource Negotiator), the resource management platform for Hadoop 2.0. It discusses the core components of YARN: the ResourceManager, which allocates cluster resources; the NodeManager, which manages resources on individual nodes; and the ApplicationMaster, which coordinates each application's execution. Containers are allocated from NodeManagers to ApplicationMasters and represent a collection of physical resources like RAM and CPU cores. YARN allows different data processing methods like batch, interactive, streaming and graph processing to run on data stored in HDFS and opens up Hadoop to various distributed applications beyond MapReduce.
Parameter substitution in Aginity WorkbenchMary Uguet
This document discusses parameter substitution in Aginity Workbench, which allows developers to write SQL queries and scripts that include parameters. This enables running queries with different filters, date ranges, or table and column names by prompting for parameter values when executing. The feature streamlines testing queries with multiple variable values by avoiding multiple find-and-replace operations. Parameters are defined using a $ prefix, and the user is prompted to supply a value and data type when running a query containing parameters.
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Nandhitha B
YARN allows different data processing methods like batch, interactive, graph, and stream processing to run on data stored in HDFS. It enables tools like Spark, Hive, and HBase to perform distributed applications beyond MapReduce. The core YARN components are the Resource Manager, which allocates cluster resources, the Node Manager, which manages containers on its node, and the Application Master, which coordinates each application's execution. Containers are allocated resources like CPU and memory on individual nodes. HDFS addresses challenges of big data storage by distributing and replicating data blocks across multiple nodes, allowing horizontal scaling. It can store structured, semi-structured, and unstructured data. YARN and MapReduce process data in parallel across nodes
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
Dear Student,
Greetings from www.etraining.guru
We provide BEST online training in Hyderabad for IBM Netezza DBA and/or Development by a senior working professional. Our Netezza Trainer comes with a working experience of 10+ years, 6+ years in Netezza and an Netezza 7.1 certified professional.
DBA Course Content: http://www.etraining.guru/course/dba/online-training-ibm-netezza-puredata-dba
Development Course Content: http://www.etraining.guru/course/ibm/online-training-ibm-puredata-netezza-development
Course Cost: USD 300 (or) INR 18000
Number of Hours: 24 hours
*Please note the course also includes Netezza certification assitance.
If there is any opportunity, we will be very happy to serve you. Appreciate if you can explore other training opportunities in our website as well.
We can be reachable at info@etraining.guru (or) 91-996-669-2446 for any further info/details.
Regards,
Karthik
www.etraining.guru"
The document discusses using R for analytics on Netezza's TwinFin appliance. TwinFin is a massively parallel processing database management system designed specifically for performance. It utilizes field programmable gate arrays and an "on-stream analytics" approach. The document outlines how R interfaces with TwinFin through functions like nzapply and nztapply that allow running R functions on TwinFin's distributed data in parallel. It provides examples of building decision trees and linear models on TwinFin tables using these functions.
This white paper discusses Oracle to Netezza migration for a Fortune 100 retailer. It describes the key steps in the migration process including impact analysis, design and development, history load, and testing. Impact analysis identifies all database objects, ETL processes, and applications/reports impacted. Design considerations include data type mapping, SQL conversion, and report changes. History data can be loaded via flat files or ETL. Rigorous testing of database objects, SQL, ETL processes, and data is recommended to identify any issues.
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
The document provides an overview of key-value stores and discusses CouchDB, Tokyo Cabinet, Redis, and Cassandra as alternatives to relational databases for web applications. It describes each system's features and interface to the Ruby programming language. CouchDB uses JavaScript for generating views and stores schema-less data accessed via HTTP/JSON. Tokyo Cabinet supports hashtable, B-tree, and table modes and is used by Mixi, a Japanese social network. Redis is an in-memory key-value store focused on performance. Cassandra is a distributed, structured key-value store developed by Facebook based on Amazon's Dynamo database.
This document describes the design and implementation of a service monitoring console within a service oriented architecture framework. It discusses using Cassandra for data collection and storage, implementing message queues for data collection, and using OpenTSDB as a time series database. It also provides an implementation schedule and references several technologies including MapReduce, Cassandra, Turmeric, and Google Web Toolkit.
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Netezza uses a proprietary architecture called Asymmetric Massively Parallel Processing (AMPP). The AMPP architecture distributes data and query processing across multiple processing blades called S-Blades. Each S-Blade contains processors, memory, and is connected to disk arrays through a database accelerator card. This architecture allows Netezza to process large volumes of data in parallel across the S-Blades for high performance. Netezza also uses some unique tools and concepts compared to traditional databases, such as not enforcing constraints for improved load performance and using hidden columns to track transaction details instead of redo logs.
The document discusses challenges in data storage and management for next-generation distributed databases. It introduces Datos CODR, a new tool that can extract consistent and deduplicated backups from distributed databases like Cassandra and MongoDB. CODR uses "deep semantic understanding" to address issues like inconsistent updates, data reshuffling during migrations, and difficulties with deduplication across distributed copies when generating backups.
GridGain Systems provides an in-memory data grid (IMDG) that offers extremely low latency access to application data stored fully in memory. Key features of GridGain's IMDG include support for distributed ACID transactions, scalable data partitioning, integration with in-memory compute grids, and datacenter replication to ensure high availability. The IMDG allows applications to work directly with domain objects and provides SQL querying capabilities for fast analysis of in-memory data.
GridGain and Hadoop are complementary technologies that can work together. Hadoop is well-suited for batch processing large datasets stored on disk for analysis, while GridGain is designed for low-latency, real-time processing of data cached in memory. GridGain can integrate with Hadoop by periodically snapshotting data to HDFS for analysis or by loading data from HDFS into its in-memory data grid for faster querying and processing. Together, these technologies provide a way to analyze data throughout its lifecycle from the operational system to the data warehouse.
Eventdriven I/O - A hands on introductionMarc Seeger
This document provides an introduction to asynchronous, event-driven I/O. It discusses how synchronous I/O limits scalability and introduces the reactor pattern for handling asynchronous events without blocking. The reactor model uses a single thread that waits for I/O completion notifications and dispatches callback handlers. Lightweight threads can also help improve readability of event-driven code by providing syntactic sugar for asynchronous operations. Common I/O frameworks that implement these patterns are also mentioned.
This document provides an overview and administration guide for Oracle Clusterware and Real Application Clusters (RAC). It describes the Oracle Clusterware and RAC software architectures, components, installation processes, and key features. The document also covers administering Oracle Clusterware components like voting disks and the Oracle Cluster Registry, storage management, database instances, services, and backup/recovery in RAC environments. Administrative tools for RAC like Enterprise Manager, SQL*Plus, and SRVCTL are also discussed.
This 24-hour training course covers administration and maintenance of the IBM Netezza data warehouse appliance. The course will teach students how to setup and configure the Netezza emulator, load and manage databases and data, perform backups and restores, tune performance, and monitor the system. Hands-on labs are included to practice administrative tasks like creating database objects, loading data, and running backups and restores. The detailed course outline covers all aspects of Netezza architecture, configuration, SQL usage, and maintenance.
This document discusses issues with processing large volumes of data and proposes an enterprise data warehouse architecture capable of handling big data. It aims to explain integrating Hadoop into existing data warehouses.
The first chapter introduces challenges of increased data volume, variety and velocity. It discusses skill shortages in big data and analytics. Existing data warehouses are built for reporting but not analyzing large, unaggregated data.
The second chapter outlines requirements for a new architecture and proposes a multi-platform data warehouse environment incorporating Hadoop. It describes Hadoop components like HDFS, YARN, Hive and tools like Sqoop.
The third chapter focuses on integrating Hadoop into existing data warehouses by implementing star schemas in Hive, addressing security,
This document discusses approaches for monitoring resource utilization in massively parallel processing (MPP) databases running on Kubernetes clusters. It first provides background on MPP databases and container orchestration using Kubernetes. It then describes Teradata's Aster Engine deployed on Kubernetes and compares various options for monitoring cluster-level and query-level resource usage in Aster Engine. The document recommends using Heapster for cluster-level monitoring due to its ease of deployment and integration with Kubernetes, and developing a custom solution for query-level resource monitoring in Aster Engine.
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
In supporting its large scale, multidisciplinary scientific research efforts across all the university campuses and by the research personnel spread over literally every corner of the state, the state of Nevada needs to build and leverage its own Cyber infrastructure. Following the well-established as-a-service model, this state-wide Cyber infrastructure that consists of data acquisition, data storage, advanced instruments, visualization, computing and information processing systems, and people, all seamlessly linked together through a high-speed network, is designed and operated to deliver the benefits of Cyber infrastructure-as-aService (CaaS).There are three major service groups in this CaaS, namely (i) supporting infrastructural
services that comprise sensors, computing/storage/networking hardware, operating system, management tools, virtualization and message passing interface (MPI); (ii) data transmission and storage services that provide connectivity to various big data sources, as well as cached and stored datasets in a distributed
storage backend; and (iii) processing and visualization services that provide user access to rich processing and visualization tools and packages essential to various scientific research workflows. Built on commodity hardware and open source software packages, the Southern Nevada Research Cloud(SNRC)and a data repository in a separate location constitute a low cost solution to deliver all these services around CaaS. The service-oriented architecture and implementation of the SNRC are geared to encapsulate as much detail of big data processing and cloud computing as possible away from end users; rather scientists only need to learn and access an interactive web-based interface to conduct their collaborative, multidisciplinary, dataintensive research. The capability and easy-to-use features of the SNRC are demonstrated through a use case that attempts to derive a solar radiation model from a large data set by regression analysis.
We discuss revise scheduling with streaming files warehouses, which blend the features of traditional files warehouses and also data supply systems. In our setting, external sources push append-only files streams into your warehouse with many inter introduction times. While classic data warehouses are normally refreshed during downtimes, streaming warehouses usually are updated while new files arrive. We design the streaming warehouse revise problem as a scheduling trouble, where jobs correspond to processes which load brand-new data in to tables, and whoever objective is usually to minimize files staleness with time. We next propose the scheduling framework that grips the troubles encountered with a stream manufacturing facility: view hierarchies and also priorities, files consistency, lack of ability to pre-empt changes, heterogeneity connected with update jobs brought on by different inter introduction times and also data quantities among various sources, and also transient clog. A story feature in our framework will be that arranging decisions tend not to depend with properties connected with update jobs such as deadlines, but instead on the effects of revise jobs with data staleness.
NOSQL databases can scale horizontally by distributing data across multiple servers through techniques like replication and sharding. Replication copies data across servers so each piece can be found in multiple places, while sharding partitions data and stores different parts on different servers. There are two main types of replication: master-slave, where one server is the master and others are slaves that copy from the master; and peer-to-peer, where all servers can accept writes. Sharding improves performance by ensuring frequently accessed data is on the same server. Replication provides redundancy and availability, while sharding allows scaling write and read operations.
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENTijdms
In recent years, NoSQL database systems have become increasingly popular, especially for big data, commercial applications. These systems were designed to overcome the scaling and flexibility limitations plaguing traditional relational database management systems (RDBMSs). Given NoSQL database systems have been typically implemented in large-scale distributed environments serving large numbers of simultaneous users across potentially thousands of geographically separated devices, little consideration has been given to evaluating their value within single-box environments. It is postulated some of the inherent traits of each NoSQL database type may be useful, perhaps even preferable, regardless of scale. Thus, this paper proposes criteria conceived to evaluate the usefulness of NoSQL systems in small-scale single-box environments. Specifically, key value, document, column family, and graph database are discussed with respect to the ability of each to provide CRUD transactions in a single-box environment
This document provides an overview of NoSQL databases. It discusses the key features of NoSQL, including that it has no fixed schema and avoids ACID properties. Cassandra is presented as a popular example of a NoSQL database, with its ability to handle large amounts of structured data without failures. The document compares NoSQL to SQL databases, noting NoSQL's advantages in scalability and performance.
ANG-GridWay-Poster-Final-Colorful-Bright-Final0Jingjing Sun
The document discusses metascheduling on the grid. A metascheduler coordinates communications between local schedulers and presents users a single resource pool by hiding scheduling details. It abstracts grid resources through cooperation with grid information services. The metascheduler should ask users only three questions about requirements, input, and identity and handle all grid-specific details.
This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.
The document describes CloudTPS, a middleware system that implements support for join queries and transactions in NoSQL cloud data stores. CloudTPS sits between web applications and their underlying data store (e.g. Bigtable, SimpleDB) to provide consistent join queries and strongly consistent multi-item transactions while retaining the scalability of the cloud data store. CloudTPS focuses on supporting foreign-key equi-join queries, which start with records identified by their primary keys and follow references to other records, allowing it to efficiently process queries that access a small number of data items.
The document describes a project report submitted by R Ashwin for the award of a Bachelor of Technology degree. It discusses the distributed implementation of the graph database system DGraph. The key steps in the distributed version include sharding the data, assigning unique IDs, loading the data into different servers, and enabling communication between servers through network calls. Performance evaluation on the Freebase film dataset showed that the distributed version had higher throughput and lower latency than the centralized version, especially as the load and computational power increased.
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
The document provides an overview of key-value stores and discusses CouchDB, Tokyo Cabinet, Redis, and Cassandra as alternatives to relational databases for web applications. It describes each system's features and interface to the Ruby programming language. CouchDB uses JavaScript for generating views and stores schema-less data accessed via HTTP/JSON. Tokyo Cabinet supports hashtable, B-tree, and table modes and is used by Mixi, a Japanese social network. Redis is an in-memory key-value store focused on performance. Cassandra is a distributed, structured key-value store developed by Facebook based on Amazon's Dynamo database.
This document describes the design and implementation of a service monitoring console within a service oriented architecture framework. It discusses using Cassandra for data collection and storage, implementing message queues for data collection, and using OpenTSDB as a time series database. It also provides an implementation schedule and references several technologies including MapReduce, Cassandra, Turmeric, and Google Web Toolkit.
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Netezza uses a proprietary architecture called Asymmetric Massively Parallel Processing (AMPP). The AMPP architecture distributes data and query processing across multiple processing blades called S-Blades. Each S-Blade contains processors, memory, and is connected to disk arrays through a database accelerator card. This architecture allows Netezza to process large volumes of data in parallel across the S-Blades for high performance. Netezza also uses some unique tools and concepts compared to traditional databases, such as not enforcing constraints for improved load performance and using hidden columns to track transaction details instead of redo logs.
The document discusses challenges in data storage and management for next-generation distributed databases. It introduces Datos CODR, a new tool that can extract consistent and deduplicated backups from distributed databases like Cassandra and MongoDB. CODR uses "deep semantic understanding" to address issues like inconsistent updates, data reshuffling during migrations, and difficulties with deduplication across distributed copies when generating backups.
GridGain Systems provides an in-memory data grid (IMDG) that offers extremely low latency access to application data stored fully in memory. Key features of GridGain's IMDG include support for distributed ACID transactions, scalable data partitioning, integration with in-memory compute grids, and datacenter replication to ensure high availability. The IMDG allows applications to work directly with domain objects and provides SQL querying capabilities for fast analysis of in-memory data.
GridGain and Hadoop are complementary technologies that can work together. Hadoop is well-suited for batch processing large datasets stored on disk for analysis, while GridGain is designed for low-latency, real-time processing of data cached in memory. GridGain can integrate with Hadoop by periodically snapshotting data to HDFS for analysis or by loading data from HDFS into its in-memory data grid for faster querying and processing. Together, these technologies provide a way to analyze data throughout its lifecycle from the operational system to the data warehouse.
Eventdriven I/O - A hands on introductionMarc Seeger
This document provides an introduction to asynchronous, event-driven I/O. It discusses how synchronous I/O limits scalability and introduces the reactor pattern for handling asynchronous events without blocking. The reactor model uses a single thread that waits for I/O completion notifications and dispatches callback handlers. Lightweight threads can also help improve readability of event-driven code by providing syntactic sugar for asynchronous operations. Common I/O frameworks that implement these patterns are also mentioned.
This document provides an overview and administration guide for Oracle Clusterware and Real Application Clusters (RAC). It describes the Oracle Clusterware and RAC software architectures, components, installation processes, and key features. The document also covers administering Oracle Clusterware components like voting disks and the Oracle Cluster Registry, storage management, database instances, services, and backup/recovery in RAC environments. Administrative tools for RAC like Enterprise Manager, SQL*Plus, and SRVCTL are also discussed.
This 24-hour training course covers administration and maintenance of the IBM Netezza data warehouse appliance. The course will teach students how to setup and configure the Netezza emulator, load and manage databases and data, perform backups and restores, tune performance, and monitor the system. Hands-on labs are included to practice administrative tasks like creating database objects, loading data, and running backups and restores. The detailed course outline covers all aspects of Netezza architecture, configuration, SQL usage, and maintenance.
This document discusses issues with processing large volumes of data and proposes an enterprise data warehouse architecture capable of handling big data. It aims to explain integrating Hadoop into existing data warehouses.
The first chapter introduces challenges of increased data volume, variety and velocity. It discusses skill shortages in big data and analytics. Existing data warehouses are built for reporting but not analyzing large, unaggregated data.
The second chapter outlines requirements for a new architecture and proposes a multi-platform data warehouse environment incorporating Hadoop. It describes Hadoop components like HDFS, YARN, Hive and tools like Sqoop.
The third chapter focuses on integrating Hadoop into existing data warehouses by implementing star schemas in Hive, addressing security,
This document discusses approaches for monitoring resource utilization in massively parallel processing (MPP) databases running on Kubernetes clusters. It first provides background on MPP databases and container orchestration using Kubernetes. It then describes Teradata's Aster Engine deployed on Kubernetes and compares various options for monitoring cluster-level and query-level resource usage in Aster Engine. The document recommends using Heapster for cluster-level monitoring due to its ease of deployment and integration with Kubernetes, and developing a custom solution for query-level resource monitoring in Aster Engine.
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
In supporting its large scale, multidisciplinary scientific research efforts across all the university campuses and by the research personnel spread over literally every corner of the state, the state of Nevada needs to build and leverage its own Cyber infrastructure. Following the well-established as-a-service model, this state-wide Cyber infrastructure that consists of data acquisition, data storage, advanced instruments, visualization, computing and information processing systems, and people, all seamlessly linked together through a high-speed network, is designed and operated to deliver the benefits of Cyber infrastructure-as-aService (CaaS).There are three major service groups in this CaaS, namely (i) supporting infrastructural
services that comprise sensors, computing/storage/networking hardware, operating system, management tools, virtualization and message passing interface (MPI); (ii) data transmission and storage services that provide connectivity to various big data sources, as well as cached and stored datasets in a distributed
storage backend; and (iii) processing and visualization services that provide user access to rich processing and visualization tools and packages essential to various scientific research workflows. Built on commodity hardware and open source software packages, the Southern Nevada Research Cloud(SNRC)and a data repository in a separate location constitute a low cost solution to deliver all these services around CaaS. The service-oriented architecture and implementation of the SNRC are geared to encapsulate as much detail of big data processing and cloud computing as possible away from end users; rather scientists only need to learn and access an interactive web-based interface to conduct their collaborative, multidisciplinary, dataintensive research. The capability and easy-to-use features of the SNRC are demonstrated through a use case that attempts to derive a solar radiation model from a large data set by regression analysis.
We discuss revise scheduling with streaming files warehouses, which blend the features of traditional files warehouses and also data supply systems. In our setting, external sources push append-only files streams into your warehouse with many inter introduction times. While classic data warehouses are normally refreshed during downtimes, streaming warehouses usually are updated while new files arrive. We design the streaming warehouse revise problem as a scheduling trouble, where jobs correspond to processes which load brand-new data in to tables, and whoever objective is usually to minimize files staleness with time. We next propose the scheduling framework that grips the troubles encountered with a stream manufacturing facility: view hierarchies and also priorities, files consistency, lack of ability to pre-empt changes, heterogeneity connected with update jobs brought on by different inter introduction times and also data quantities among various sources, and also transient clog. A story feature in our framework will be that arranging decisions tend not to depend with properties connected with update jobs such as deadlines, but instead on the effects of revise jobs with data staleness.
NOSQL databases can scale horizontally by distributing data across multiple servers through techniques like replication and sharding. Replication copies data across servers so each piece can be found in multiple places, while sharding partitions data and stores different parts on different servers. There are two main types of replication: master-slave, where one server is the master and others are slaves that copy from the master; and peer-to-peer, where all servers can accept writes. Sharding improves performance by ensuring frequently accessed data is on the same server. Replication provides redundancy and availability, while sharding allows scaling write and read operations.
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENTijdms
In recent years, NoSQL database systems have become increasingly popular, especially for big data, commercial applications. These systems were designed to overcome the scaling and flexibility limitations plaguing traditional relational database management systems (RDBMSs). Given NoSQL database systems have been typically implemented in large-scale distributed environments serving large numbers of simultaneous users across potentially thousands of geographically separated devices, little consideration has been given to evaluating their value within single-box environments. It is postulated some of the inherent traits of each NoSQL database type may be useful, perhaps even preferable, regardless of scale. Thus, this paper proposes criteria conceived to evaluate the usefulness of NoSQL systems in small-scale single-box environments. Specifically, key value, document, column family, and graph database are discussed with respect to the ability of each to provide CRUD transactions in a single-box environment
This document provides an overview of NoSQL databases. It discusses the key features of NoSQL, including that it has no fixed schema and avoids ACID properties. Cassandra is presented as a popular example of a NoSQL database, with its ability to handle large amounts of structured data without failures. The document compares NoSQL to SQL databases, noting NoSQL's advantages in scalability and performance.
ANG-GridWay-Poster-Final-Colorful-Bright-Final0Jingjing Sun
The document discusses metascheduling on the grid. A metascheduler coordinates communications between local schedulers and presents users a single resource pool by hiding scheduling details. It abstracts grid resources through cooperation with grid information services. The metascheduler should ask users only three questions about requirements, input, and identity and handle all grid-specific details.
This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.
The document describes CloudTPS, a middleware system that implements support for join queries and transactions in NoSQL cloud data stores. CloudTPS sits between web applications and their underlying data store (e.g. Bigtable, SimpleDB) to provide consistent join queries and strongly consistent multi-item transactions while retaining the scalability of the cloud data store. CloudTPS focuses on supporting foreign-key equi-join queries, which start with records identified by their primary keys and follow references to other records, allowing it to efficiently process queries that access a small number of data items.
The document describes a project report submitted by R Ashwin for the award of a Bachelor of Technology degree. It discusses the distributed implementation of the graph database system DGraph. The key steps in the distributed version include sharding the data, assigning unique IDs, loading the data into different servers, and enabling communication between servers through network calls. Performance evaluation on the Freebase film dataset showed that the distributed version had higher throughput and lower latency than the centralized version, especially as the load and computational power increased.
TimeXtender is the leading Data Warehouse Automation on Microsoft SQL platform. The enclosed presentation provides you with a list of the features and functionality available.
Postgres plus cloud_database_getting_started_guideice1oog
This document provides an overview and getting started guide for Postgres Plus Cloud Database. It describes how to access the Cloud Database management console by registering as a new user. It then outlines how to use the console to create and manage Postgres database clusters in the cloud, including performing backups, scaling clusters, and connecting applications. The document provides guidance on administrative tasks and troubleshooting as well.
The document describes VMware vFabric GemFire, a distributed in-memory data platform. Key points:
- It manages data in pooled cluster memory rather than on disk for improved performance. Data can be fully replicated or partitioned across nodes.
- It supports reliable publish-subscribe of data changes and "continuous querying" to provide low-latency updates and event-driven capabilities.
- Application functions can be executed in parallel across nodes, allowing "data-aware" and distributed behavior. This provides better scalability than centralizing logic in a single database node.
The document outlines a proposed curriculum for data science classes. It covers topics in Python, basic mathematics, feature engineering, supervised learning, unsupervised learning, and data mining. The Python section includes the history of Python, basic structures like variables and data types, operators and control flow, functions, and data structures. The basic mathematics section covers linear algebra, probability and statistics, and optimization. Later sections discuss feature engineering, supervised learning techniques from basic to advanced, unsupervised learning, and data mining.
This document provides an introduction to the Netezza database appliance, including its architecture and key components. The Netezza uses an Asymmetric Massively Parallel Processing (AMPP) architecture with an array of servers (S-Blades) connected to disks. Each S-Blade contains a Database Accelerator card that offloads processing from the CPU. The document outlines the various hardware components and how they work together to process queries in parallel. It also defines common Netezza objects like users, groups, tables and databases that can be created and managed.
The document summarizes two papers about MapReduce frameworks for cloud computing. The first paper describes Hadoop, which uses MapReduce and HDFS to process large amounts of distributed data across clusters. HDFS stores data across cluster nodes in a fault-tolerant manner, while MapReduce splits jobs into parallel map and reduce tasks. The second paper discusses P2P-MapReduce, which allows for a dynamic cloud environment where nodes can join and leave. It uses a peer-to-peer model where nodes can be masters or slaves, and maintains backup masters to prevent job loss if the primary master fails.
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...Evaldas Taroza
This document is the master's thesis of Evaldas Taroza carried out at the Database and Artificial Intelligence Group of the Vienna University of Technology under the supervision of Prof. Dr. Georg Gottlob and Dr. Robert Baumgartner. The thesis focuses on schema matching in the context of automatic web data extraction and integration. It provides a study of current schema matching paradigms, approaches, techniques, and compares traditional and holistic schema matching solutions. As a case study, the thesis considers schema matching needs in the AllRight project, which involves integrating schemas from extracted web data.
This document discusses the challenges of building an optimal data management platform that can leverage on-demand hardware resources. It summarizes the CAP theorem, which states that a distributed system cannot simultaneously provide consistency, availability, and partition tolerance. The document introduces Pivotal's solution, called the Enterprise Data Fabric (EDF), which is designed to mine the gap between strong consistency and availability. The EDF uses service entities, membership roles, and configurable consistency levels to optimize for consistency and availability based on data and workflow requirements. It exploits parallelism and caches data to improve performance across distributed and global deployments.
Progress OpenEdge database administration guide and referenceVinh Nguyen
The document is a guide and reference for Progress Database Administration. It discusses Progress database architecture, storage design, configuration variables, limits, and procedures for creating and deleting databases. The guide contains information to help plan, administer, and optimize Progress databases.
The document provides details about a business intelligence project for a fast fashion company. It includes:
1) An overview of the company's existing operational systems like their online website, mobile app, and desktop app and the limits of these systems.
2) An analysis of the company's requirements and key metrics like KPIs and KRIs that the BI project will measure.
3) The design of the BI project using a star schema model with dimensions like products, customers, and dates.
4) Details of the implementation including the hardware/software environment, ETL process using Talend, staging the data, transforming it, and loading it into views and tables with foreign keys in SQL Server.
This thesis aims to improve the data acquisition and retrieval performance for sensor data in an Institute of Railway Technology project. Currently, data is acquired slowly into a MySQL database. The thesis proposes using MongoDB's sharding technology and Apache Spark to enable parallel data acquisition across multiple servers. For retrieval, it explores performing searches directly in MongoDB as well as using Spark SQL to retrieve data faster. Extensive testing was conducted deploying a sharded MongoDB cluster and benchmarking acquisition and retrieval times to demonstrate performance improvements.
This document provides an overview of how MySQL can be used to deliver high scalability and availability while maintaining the benefits of relational data and rich queries. It discusses using Memcached APIs for NoSQL access to MySQL data, scaling MySQL with replication and sharding, and the capabilities of MySQL Cluster for auto-sharding, high availability, and online schema maintenance.
This document provides an overview of NoSQL databases and how they can be used to manage big data. It defines NoSQL, describes the different types of NoSQL databases (key-value, document-oriented, column-oriented, and graph oriented), and discusses the business drivers that led to the emergence of NoSQL like volume, velocity, variability, and agility of data. It also discusses using NoSQL to handle big data problems through distributed processing and providing examples of big data use cases.
This document discusses applying machine learning techniques including text retrieval, association rule mining, and decision tree learning using R. It introduces the movie review dataset and preprocessing steps like removing stopwords and stemming. Text retrieval is performed to create a document-term matrix from the reviews. Association rules are generated from a sample of negative reviews using the Apriori algorithm. Decision trees are built on the combined document-term matrix and sentiment labels to classify review sentiment.
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...EMC
This Cisco white paper describes how the combination of EMC VNX storage matched to Cisco UCS B-Series blade servers offers a major deployment platform boost that is urgently needed to contend with the rapid increase in data volume and processing demand for Oracle data warehouse projects.
A Step Towards A Scalable Dynamic Single Assignment ConversionSandra Long
The document discusses two new methods for transforming a program into dynamic single assignment (DSA) form. DSA form is important for enabling optimizations in data-intensive programs through a methodology called Data Transfer and Storage Exploration (DTSE). Existing DSA conversion methods have limitations in terms of scalability and applicability. The two new methods presented in the document aim to overcome these problems. Both methods add copy operations to simplify the data flow, enabling a simpler DSA conversion. However, copy propagation is then used to remove unnecessary copy operations and obtain an optimized DSA form of the program suitable for DTSE optimizations.
This document provides guidance on performance tuning IBM Tivoli Directory Server for large user environments. It discusses establishing service level objectives and agreements, doing health checks of the directory, tools for assisting with DB2 tuning, important DB2 settings related to LDAP performance, adjusting table cardinality, and several scripts and tools for monitoring and tuning the directory server.
Performance tuning for ibm tivoli directory server redp4258
paper
1. IT 15 025
Examensarbete 15 hp
Mars 2015
NoSQL: Moving from MapReduce
Batch Jobs to Event-Driven Data
Collection
Lukas Klingsbo
Institutionen för informationsteknologi
Department of Information Technology
2.
3. Teknisk- naturvetenskaplig fakultet
UTH-enheten
Besöksadress:
Ångströmlaboratoriet
Lägerhyddsvägen 1
Hus 4, Plan 0
Postadress:
Box 536
751 21 Uppsala
Telefon:
018 – 471 30 03
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Abstract
NoSQL: Moving from MapReduce Batch Jobs to
Event-Driven Data Collection
Lukas Klingsbo
Collecting and analysing data of analytical value is important for many
service providers today. Many make use of NoSQL databases for their larger
software systems, what is less known is how to effectively analyse and gather
business intelligence from the data in these systems. This paper suggests a
method of separating the most valuable analytical data from the rest in real
time and at the same time providing an effective traditional database for the
analyser.
In this paper we analyse our given data sets to decide whether big data
tools are required and then traditional databases are compared to see how
well they fit the context. A technique that makes use of an asynchronous log-
ging system is used to insert the data from the main system to the dedicated
analytical database.
The tests show that our technique can efficiently be used with a tra-
ditional database even on large data sets (>1000000 insertions/hour per
database node) and still provide both historical data and aggregate func-
tions for the analyser.
Tryckt av: Reprocentralen ITC
IT 15 025
Examinator: Olle Gällmo
Ämnesgranskare: Kjell Orsborn
Handledare: Bip Thelin
7. 1 Introduction to the
"Big Data/NoSQL problem"
In times when big data collection is growing rapidly [1] companies and researchers
alike face the problem of doing analysis of the data that has been collected.
The enormous amounts of data collected are often saved in distributed big data
databases or key-value stores with fast insertion operations but with compara-
tively slow and complex (but highly parallelizable) query functions [2]. This work
tries to solve such a problem at Kivra [3], a company working to digitalise and
minimise the use and need of conventional paper mail. To gather analytical data
from Kivra’s distributed database today requires heavy Map/Reduce queries. To
do day to day analysis of the data slowly becomes unbearable and non-scalable as
it takes several hours or even days to run a Map/Reduced [4] batch job to fetch
the data that is going to be analysed. The solution proposed and examined by
this paper is to structure and asynchronously collect data after interesting events
occur, that are of a high analytical value, into a smaller and more easily queryable
database.
1
8. 2 Related Terminology
2.1 Big data
Big data [5] is a loose term for ways of collecting and storing unstructured complex
data of such large volumes that normal methods of data manipulation and querying
can not be used.
2.2 NoSQL
NoSQL [6] is a term that have not fully settled its meaning, it started by meaning
no relational databases [6], but now it has moved on to commonly be referred
to as Not Only SQL. If a database calls itself NoSQL it usually means that is is
non-relational, distributed, horizontally scalable, schema-free, eventually consis-
tent and support large amounts of data [7].
2.3 MapReduce
MapReduce [4] is referring to a programming model that makes parallel processing
of large data sets possible. In simplified terms it consists of two functions, one
called Map and one called Reduce. The Map function is sent out to each node
from a master node and the function is separately executed on each nodes data
set and then each node assembles the result back together with the other nodes so
that all input values with the same results are grouped together. Then the reduce
function is executed, either in parallel or on a master node, and this function does
calculations upon each separate result list with the data from all nodes that ran
the Map function and it then combines the input from them to the answer that
was originally inquired.
2.4 Eventual consistency
The expression eventual consistency is a term used for describing how data is pro-
cessed in database systems. Eventual consistency means that the data may not be
consistent in any one moment, but will eventually converge to being consistent [8].
This makes it possible for distributed databases to operate faster as data does
not have to be locked during insertions, updates and deletion and thus does not
have to be propagated to all nodes before more actions on the same data can be
sent to a node. If one or more nodes operate on the same data as another node
whose changes have not propagated through the system yet, the system will have
to handle in which order the operations should be performed, this is usually called
sibling or conflict resolution [9].
2
9. 2.5 Riak
Riak is a key-value datastore that implements NoSQL concepts such as eventual
consistency. It is usually configured to run with at least 5 nodes to be as efficient as
possible. To query the datastore you either use the REST-ful API or MapReduce
functionality that it provides.
Riak [10] provides fault tolerance and high availability by replicating data to
several nodes, the default is to have all data replicated to three nodes.
2.6 Lager
Basho [11], the company that developed Riak [10], has developed a modular logging
framework called Lager [12]. Lager is modular in the sense that it is possible to
write separate backends which all can separately do asynchronous data collection.
As the system performance can not be noticeably affected by adding a subsystem
to collect analytically valued events it is of highest priority that they can be sent
asynchronously so that the implication of implementing such a subsystem wont
result in delays for the main system.
A problem that these form of logging systems can experience is that the events
that they have to handle queue up faster than they can be processed. To solve
this Lager [12] has a built-in overload protection to make sure that its message
queue will not overflow. It does this by swapping synchronous and asynchronous
messaging for the separate backend.
2.7 Abbreviations
2.7.1 RDBMS
RDBMS [13] is an abbreviation for relational database management system. This
is the most commonly used database management system and is usually referred to
as a traditional database. Data is stored within tables and data can be in relation
to each other across tables and such databases are therefore called relational.
2.7.2 DDBMS
DDBMS [14] is an abbreviation for distributed database management system. It
is used to centrally manage a distributed database system and to give the user the
impression and functionality as though the database system would run on only
one node.
3
10. 3 Background
3.1 About Kivra
Kivra [3] is a company which aims to digitalise and eliminate the use of superflu-
ous analogue paper mail. They do this by making it possible for companies and
governments to send documents digitally to customers or citizens. By tying the
users’ accounts to their social security numbers in the service the sender can be
sure that the document reaches the correct recipient.
Today Kivra has a continuously growing user base of more than 500.000 users,
according to their website [3].
Figure 1: The user interface of Kivra
As can be seen in figure 1, the interface reminds of the interface of any email
provider, however the documents sent to the users can be interactive and have a
more advanced layout and the main advantage over normal email is that every
users identity is established in the registration process. The user can validate its
identity either by using mobile BankID [15] or by receiving an activation code to
its registered place of residence. This advantage makes it possible for companies
and organizations that otherwise would have had to send documents to a registered
postal address to being able to send them digitally through Kivra.
4
11. 3.2 The current system
Kivra’s users interact with the backend system (this is the backend box represented
in Figure 2) either through Kivra’s web page or their app, where they can receive
documents, pay bills, set whom they accept documents from and also set personal
details. Users sign up either though mobile bankID or requests an activation code
to be sent to the address registered to their social security number, this is to
ensure the identity of the user. When a user receives a new document it is notified
through email or SMS and can then log in to read the document.
The data which is of analytical interest is information about the users, meta
data of the documents sent out to the users and whether a user accepts or does not
accept (internally called send requests) documents from a company or organization
(internally called a tenants).
5
12. 3.3 Problem description
The part of the current system that is being examined consists of Riak, a dis-
tributed key-value datastore, and a batch script containing MapReduce queries
which are running each night to collect data to a PostgreSQL database which is
later used for deeper analysis. This system architecture can be seen in Figure 2.
The batch script is emptying all of the analytically interesting buckets in the Riak
cluster, currently there are two buckets being emptied every night. One of them
contains send requests which is meta data telling the system if a user accepts
documents for a specific tenant and the other bucket just contains plain user in-
formation. In the future meta data regarding documents sent to users will be
collected as well.
This solution does not scale and is also inefficient as you have to go through
all the data in the growing buckets that are of interest every night, instead of
accumulating the data to a separate database dedicated for analysis at the same
time as you insert or change data in Riak.
Figure 2: The analytical system at Kivra
6
13. 4 Methods for determining
implementation details
This chapter introduces the different methods used to determine how the new
system should be implemented, which DBMS it should use and how the estimation
of long term scaling was done.
4.1 Data size and growth estimation
The data size and growth was estimated with statistics collected from the current
Riak cluster in Kivra’s system. What the current insertion rate would be like with
an event driven implementation was estimated by doing SQL queries measuring the
number of insertions that could be made within certain time limits. To estimate the
size and growth of the future database implementation both past estimations and
the current internal estimated growth of the company were used. This, together
with the scalability measurements, was required to make sure that the solution
would be durable in the long run.
4.2 Scalability measurement
The scalability was estimated for the typical server hardware of Kivra with the
different databases and with statistics of growth estimations from the current
database. This was necessary to make sure that the system would scale with
the expected data growth, within a reasonable time frame.
4.3 DBMS comparison
The database management systems were compared by testing required abilities
and examining load and speed by running insertion jobs of different sizes. The
DBMSs architecture and suitability were also compared to see which one suited
the context best. These comparisons were vital to make sure that the chosen
DBMS could handle the expected load of the system.
7
14. 5 Solution for storing events
When an event of analytical interest occurs, an asynchronous call with information
about the event is sent to Lager. Lager then directs it to an in-house tailored Lager
backend, which is responsible for making sure that the data is stored consistently
and correctly. The database in which the data is stored can then be queried at any
time to perform near to real time analysis of the system. After analysing internal
database statistics an approach of this form is estimated to cause approximately
100-10000 inserts/hour depending on if it is during the peak hours or not. The
estimation can not be averaged as the chosen database is required to be able to
handle the peak hours without a problem. This approach will lead to a system
with no updates and only reads during analytical calculations. This is due to the
fact that the backend will handle each event with a strict progression of inserting
the relevant metadata of the event with a timestamp and never have a need of
updating, nor removing old data, as all of these chosen events are valuable to store
to enable historical queries. This is similar to how Lager [12] logs events to files,
except that the analysis of the data will be easier due to the built-in aggregate
functions of modern databases. In comparison to the current system it will be
possible to perform near to instant analysis of the data already collected instead
of having to re-collect all the data of analytical interest from the Riak cluster each
time an up-to-date analysis of the data has to be performed.
5.1 Data sets
5.1.1 Determining the type of the data sets
Big data is one of the popular buzzwords in the computer industry today and a lot
of revolutionising technology has been built around it. But it is not always efficient
to use technology built for these data sets if the data set that is being operated
on does not fit the definition of big data. But as there is a thriving hype of higher
yield with existing data some companies and institutions use it just for the sake of
being part of the trend without actually having a data set that objectively would
be classified as big data [16].
To simplify the abstract definition of Big Data to determine whether a data
set can be defined as such you can classify the data set by concluding the answers
of three questions [5], namely the following:
• Is the data set large?
• Is the data set complex (maybe even unstructured)?
• Are methods optimized for other data sets inefficient?
8
15. If the data set satisfies these properties it can be a good idea to store the data
in a Hadoop cluster or similar technology developed specifically for big data. If,
however, the data set does not satisfy these criteria it is usually better to look for
another storage solution [16].
At Kivra there are a few different data sets with information that is valuable
for ongoing analysis. The three most important data sets are:
• User information - each users personal details.
• Send requests - connects a user to a tenant and specifies whether the user
wants or does not want content from that specific tenant.
• Content metadata - contains meta data of content that has been sent to
users.
The tables representing this data would be quite simple and would not store too
much information as analysis of the system would mostly regard metadata of some
of the information stored in the Riak cluster. See an example of how such tables
could look like in listing 1.
Listing 1: Example tables
Example f i e l d s of the tables
+−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−−−−−+
| User | | Send request | | Content Metadata |
+−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−−−−−+
| id | | user_id | | content_id |
| name | | tenant_id | | tenant_id |
| address | | accepted | | user_id |
| city | | timestamp | | timestamp |
| email | | | | |
| timestamp | | | | |
+−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−−−−−+
5.1.2 Size of the data sets
As each record in the data sets is fairly small (under 200 bytes) it is enough to
count the number of records plus the expected growth for the future. Today there
are about 500 000 users with information [3]. The send requests records adds up
to a little bit less than the number of users times the number of tenants and then
there is also around 2 million content records. This results in a total of about 10
million records. All of the data sets have an expected growth of about 5% per
9
16. month according to internal statistics of Kivra collected from their Riak cluster.
This is well within the limits of traditional relational databases [17].
5.1.3 Complexity of the data sets
None of the data sets have any complex structures, the data types are not varying
and there are no big documents that requires to be stored in the analytical database
as only the meta data of the content being sent to users has to be saved and not
any attachments, images etc.
5.1.4 Inefficiency of other methods
There is no need for extremely fast reads for specific values which key-value
stores [18] offer or for extremely fast insertion rates which big data databases [5]
can provide. As concluded in the section about the size of the data sets ( 5.1.2)
and the section called Choosing SQL database ( 5.3.3) the insertion rate of tradi-
tional databases will be well above the demands. As there is no complex data the
normal table structure of a RDBMS will be suitable for storage of the data.
10
17. 5.2 Protocols for data transfer
The requirements of the protocol and data structure used to transfer and translate
data to the database differs highly on what type of data and from where the data
is being sent, some data are required to have absolute transport security and
reliability meanwhile other is of lower importance and rather trades transport
reliability for speed. To maximize the speed and flexibility of the system as a
whole, one or several APIs and web services that can accept several protocols and
differently structured data being used towards it before translating it in order to
store it in the database should be used, as can be seen in Figure 3.
The exact APIs, Web Services and data structures used for the implementa-
tion of this system is outside the scope for this work and will be left to a future
implementer to decide. But examples of such protocols could be HTTP through a
RESTful [19] service, cURL [20] with one of its many supported protocols or simply
whatever protocol that will be built in to the customised Lager [12] backend.
Figure 3: API explanation
5.3 Database
5.3.1 Requirements
As the database will mostly be used for insertions, the insert operations will have
to be as fast as possible with the chosen way to store the data. Read operations
will not have to be very fast as they are not directly used by any response time
critical system but rather by in-house users, a business intelligence system or a
system providing an interface of generated graphs and tables for overviewing the
information.
11
18. The requirements on the speed of updates and deletes are non essential as they
will only occur when manually editing the database. The database will not have
to be portable, it is not going to be replicated or moved around. For future scaling
of the database it is needed to be easily configurable and preferably have wide
support for distribution of the data over several nodes. The data is required to be
accessible and easy to aggregate upon for future analysis.
5.3.2 Why SQL over NoSQL?
A traditional relational database system was chosen over a NoSQL document store
or key-value store for a few different reasons. The first reason was that the high
insertion rate trade off that results in eventual consistency [21] given by NoSQL
databases would simply not be needed as the amount of data insertions and updates
would be well within the limits of a traditional RDBMS [17] as can be seen in
Section 6.1.
The second reason was that the query speed of specific values, offered by key-
value stores [18], was not required and the demand was rather for a full set of
aggregate functions and easy access to complete subsets of the stored data. As
many NoSQL databases use MapReduce for fetching data it would per definition
be hard to write efficient joins and no built in aggregate functions would be pro-
vided [22].
What was also considered was putting a layer on top of a regular NoSQL
database that seamlessly provided this type of functionality. The layers consid-
ered were HIVE [23], PIG [24] and Scalding [25] for HADOOP [26]. But as the
implementation did not need the special functionalities and speed trade-offs, which
is deeper explained in Section 6.1 and Section 5.3.1, that were offered by such layers
the conclusion was to simply incorporate a traditional SQL database.
5.3.3 Choosing SQL database
There are a large number of different traditional database systems being actively
developed today. For this project the focus has been on popular open source
DBMSs. The most popular open source relational database management sys-
tems [27] today are MySQL, PostgreSQL and SQLite so the tests compare those
towards each other and the requirements.
12
19. MySQL
MySQL [28] is the most popular open source database and has several popular
forks as well, with the biggest one [27] called MariaDB [29]. MySQL does not aim
at implementing the full SQL standard, but rather a subset of it with some extra
features [27]. MySQL is built in layers with one or several storage engines in the
bottom and an optimizer, parser, query cache and connection handling layers built
on top of that [30], see Figure 4.
Figure 4: MySQL Architecture [30]
MySQL has a rich set of aggregate functions which are very useful when doing
statistical analysis of the stored data. In total, MySQL has 17 aggregate functions
in which the most common SQL standard aggregate functions AVG, COUNT,
MAX, MIN, STD and SUM are included [31].
13
20. PostgreSQL
PostgreSQL is one of the few modern SQL databases that implements the whole
SQL standard (ANSI-SQL:2008) [32]. The architecture of PostgreSQL is quite
straight forward, as can be seen in Figure 5. The postmaster handles new connec-
tions and spawns a PostgreSQL backend process that the client can directly com-
municate with. All communication between client and PostgreSQL goes through
that process after it spawns. The postmaster stays active and awaits new incom-
ing connections. Each backend that a client is connected to has its own separate
parser, rewrite system, optimiser and executor. All of the active backends then
uses the same storage engine [33]. PostgreSQL, compared to MySQL, only has one
very tightly integrated storage engine that in theory could be replaced but that
has not been done in practice yet [34].
Figure 5: PostgreSQL Architecture [33]
PostgreSQL have a comparatively large set of aggregate functions including 14
general purpose functions and 18 functions mainly used for statistics, totalling in
32 functions [35].
14
21. SQLite
SQLite as opposed to MySQL and PostgreSQL is not built as a traditional two
tier architecture which can be seen in Figure 6. This leads to that each call to
it is done through function calls to a linked SQLite library [36]. SQLite is quite
small (658KiB) and has therefore become widely used as an embedded database
for local storage [37]. SQLite is bundled with two storage engines, one of them is
built with a log-structured merge-tree and optimised for persistent databases and
the other one is built on a binary tree which is kept in memory which makes it
suitable for temporary databases. The architecture of SQLite has made it possible
to design and interchange new storage engines with the built-in ones [38].
Figure 6: SQLite Architecture [37]
SQLite has 9 of the most basic aggregate functions and lacks the more sta-
tistical functionality such as for example standard deviation and covariance that
PostgreSQL and MySQL implement [39]. SQLite’s architecture is very focused on
portability [37] and have made some trade-offs in order to achieve that, as can
be seen in its simplistic architecture in Figure 6. One result of these trade-offs
is that insertions become very slow, ∼ 60 insertions per minute with transactions
enabled [40].
As portability is not required for the implementation of the system regarded in
this paper, because it is missing many important aggregate functions and because
of its slow insertion rate, SQLite will be excluded from the tests in the next section.
15
22. 5.3.4 Performance tests
As explained in section 5.3.3, SQLite was excluded from the tests as its architecture
was deemed unfitting for the task at hand and for the lack of many aggregate
functions. The statistics were generated on an average consumer computer with
the following specifications:
CPU: Intel Core2 Duo T7700 2.40GHz
RAM: SODIMM DDR2 Synchronous 667 MHz 2x2GB
Disk: KINGSTON SV300S3 120GB SSD partitioned with EXT4
OS: Debian Jessie (Testing)
PostgreSQL version 9.4 and MySQL version 5.6 were used during the tests and
the packages were installed from the default Debian testing repositories.
The database and table were created in the same manner on both systems as can
be seen in listing 2.
Listing 2: Database and table creation
create database testdb
create table test_table
( data varchar (100) ,
moddate timestamp default current_timestamp ) ;
The tests were conducted with a series of simple insertions containing random
data, as shown in listing 3.
Listing 3: Example insertion
i n s e r t into test_table ( data )
values ( ’010100100101010100... ’);
The execution times were measured by using the time [41] utility of the GNU
project, it measures the time of execution and resource usage of the specified
operation. The different files that were used in the tests contained 1000, 10 000,
100 000 and
1 000 000 insertions of random data. When deploying this system the production
servers will most likely be a lot more powerful than the consumer laptop that these
test were run on. These tests non the less gives a correct representation of how well
the different databases will scale. To reproduce or verify the results, see listings 3
and 4 in Appendix A.
As can be seen in Figure 7 both of the systems have a very close to linear
performance and any small disturbances in the pattern can be concluded to be
from other processes using system resources. Both of the systems perform over
the preconditions for an implementation with requirements similar to the ones in
this context. But as insertion rate and aggregate functions are of high concern for
such an implementation and both systems comply to the rest of the requirements,
PostgreSQL will be recommended to use for a future implementation.
16
23. Figure 7: Performance graph
5.4 Storage requirements
As the data comes in as events, there is never a need to update old records in
the database, as can be seen in Figure 8. This structure is slightly harder to do
complicated queries (Listing 4) towards, as you always have to take the timestamp
of the event into consideration. But it also makes it possible to configure the
DBMS not to lock tables during inserts, which will increase the speed of insertion
of the database. The lack of locks will be especially useful if it is later decided
to use a DDBMS for an eventual requirement of scaling up the system in the
future. As everything is logged with timestamps it is particularly suitable for
business intelligence systems and the likes of such, as you can query time ranges
in the past as well as the current state of the system to gather analytics and make
comparisons.
Figure 8: Flow of data
17
24. The only time inserts occur is during the time when the customised Lager [12]
backend or later added data sources sends events (1, see Figure 8) and the only
times reads are performed is when a manual query (2, see Figure 8) is executed
or when an eventual business intelligence system or an automated analyse system
(3, see Figure 8) that automatically queries the system is used. It will never be
necessary to perform updates and no regular deletes will have to be performed, as
explained in Section 5.3.1.
The actual database size should not be a problem as the servers are usually
equipped with a few terabytes of disk space. An average insertion of Kivra is
estimated to around 200 bytes by looking at meta data formed by data currently
in the Riak cluster. This means that 5 ∗ 109
insertions would fit on a disk with
1TB of space. A rough estimation of an average of 5000 insertions per hour means
that a 1TB would last for more than 113 years, see Figure 9.
1. 1TB/200B ≈ 5 ∗ 109
insertions
2. 5 ∗ 109
/5000/24/365 ≈ 113.5 years
Figure 9: Estimation of disk space usage
18
25. 6 Resulting system
The result of an implementation of the suggested solution of this paper would lead
to a core system that would have a data flow as described in Figure 10. In this
flow, when data of analytical interest would be inserted into the Riak cluster it
would also be sent to Lager which would insert it into a RDBMS dedicated for
analytical data, instead of fetching data in Riak after it has been inserted as was it
done in the previous systems data flow that can be seen in Figure 2 and explained
in Section 3.2.
Figure 10: Resulting system
What is not shown in Figure 10, is that separate data sources for the analyt-
ical database can be used. These can connect with different protocols and with
differently structured data to an API that will be receiving data for the database
as described in Section 5.2.
6.1 Scalability
If an unoptimised PostgreSQL database on an average consumer computer is used
and with the assumption that little to no data transfer latency will occur during
insertion of data the insertion rate can be increased by a factor of at least 100
if following the upper boundary of the estimation (Section 5) of 10000 insertions
per hour. As can be seen in the load test in 5.3.4, PostgreSQL can handle a little
bit over 1 million insertions per hour. The expected data growth rate is 5% per
month, with this growth the same setup could be used for almost 8 years, see
Figure 11.
1. 10000 ∗ 1.0594
≈ 981282 insertions/hour
2. 94/12 ≈ 7.83 years
Figure 11: Estimation of data transfer scaling
19
26. In the unlikelihood that the same setup with the same hardware and Post-
greSQL version would be used in 8 years the PostgreSQL database could easily be
sharded, clustered or optimised to reach a lot higher insertion rates. As all he data
is independent and has no defined relations to other data, clustering with a pool of
independent databases could be used without any requirement of the databases to
synchronise with each other, this would mean that you can linearly add 1000000
insertions/hour for each database node that you add to the cluster.
6.2 Practical Example
6.2.1 Storing an Event
After an implementation of a solution like this at Kivra the following would be
the flow of a typical change triggered by a user. The system would update the
send request bucket of Riak in Figure 12, but changes to other buckets would look
exactly the same.
Figure 12 is a practical example of what would happen if a user updated his
or her preferences of which tenants it wishes to receive documents from.
Figure 12: Practical example of a send request changing
20
27. 6.2.2 Querying Data
To query data stored in the analytical database a query that takes the timestamps
into consideration has to be performed. The query in listing 4 is an example of
how to query which tenants all users wish and do not wish to receive documents
from. Queries for other data sets would look similar.
Listing 4: Example query
SELECT SRQ. user_id , SRQ. tenant_id , SRQ. timestamp , SRQ. accepted
FROM sendrequest SRQ
INNER JOIN
(SELECT user_id , tenant_id , max( timestamp ) as l a t e s t
FROM sendrequest GROUP BY user_id , tenant_id ) SRQ1
ON SRQ. user_id = SRQ1. user_id
AND SRQ. tenant_id = SRQ1. tenant_id
AND SRQ. timestamp = SRQ1. l a t e s t ;
The query in listing 4 doesn’t take into consideration if the same name and
tenant combination exists with the same timestamp. It is however highly unlikely
that it would happen as the timestamps are stored with a micro second precision.
To fully avoid that event, add an auto incremented ID column to the table and
only consider the highest ID in the case of a clash.
21
28. 7 Discussion
7.1 Event-driven data collection
The suggested solution introduced by this paper was named event-driven data
collection. The name has been used before for slightly similar purposes with the
first recorded case being in 1974 [42]. It has also been used in a different manner in
quite a few papers on the subject wireless sensor networks [43] and some regarding
the collection of natural ionosphere scintillations [44].
7.2 Flat file database
A flat file database was not compared in this paper because of the precondition
of having pre-built APIs and to give easier access to querying and aggregation of
data. A flat file database can also be complicated to scale out to several servers
for an eventual future requirement.
22
29. 8 Summary
8.1 Conclusions
The system that has been designed in this paper can be used as an effective tool for
logging and analysing events. This form of approach is especially valuable if the
main system is using some form of MapReduce database which can be hard to do
ongoing analysis of aggregated data on. If the main system is already using a SQL
database as its primary form of information storage this type of solution would be
quite inefficient due to duplication of data and possibly having to serve another
DBMS. In the benchmarking of the different database management systems the
standard version without modifications have been used, with modifications and
optimisations the outcome would have been very different.
The unorthodox design of the database schema in this paper provides a full
historical log of events which makes it possible to compare data from different time
ranges, which can be very valuable when analysing data from business systems to
for example detect trends and tendencies within the user base.
Another contribution of this work is the use of a logging system to effectively
put in an analytical systems to an already existing backend with minimal impact
on the performance of the existing system.
8.2 Future work
To finalise the architecture of this system, APIs and web service structures will
have to be researched or developed to be put on top of the database. For larger
flows of data other solutions would have to be considered or alternatively the
used DBMS has to be heavily optimised by for example removing insertion locks.
Scaling out to a clustered database could also be an alternative.
23
30. Appendices
A Benchmarks
Listing 5: PostgreSQL insertion test
1 000 000 i n s e r t i o n s
$ time psql −−quiet −f test4 . sql
psql −−quiet −f test4 . sql
40.42 s user 23.08 s system 2% cpu 31:05.03 t o t a l
100 000 i n s e r t i o n s
$ time psql −−quiet −f t e s t . sql
psql −−quiet −f t e s t . sql
4.32 s user 2.38 s system 2% cpu 5:02.94 t o t a l
10 000 i n s e r t i o n s
$ time psql −−quiet −f test2 . sql
psql −−quiet −f test2 . sql
0.51 s user 0.22 s system 2% cpu 30.172 t o t a l
1000 i n s e r t i o n s
$ time psql −−quiet −f test3 . sql
psql −−quiet −f test3 . sql
0.11 s user 0.02 s system 4% cpu 3.077 t o t a l
Listing 6: MySQL insertion test
1 000 000 i n s e r t i o n s
$ time mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb < test4 . sql
mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb <
39.78 s user 23.36 s system 1% cpu 1:36:56.31 t o t a l
100 000 i n s e r t i o n s
$ time mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb < t e s t . sql
mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb <
3.88 s user 2.83 s system 1% cpu 9:22.67 t o t a l
10 000 i n s e r t i o n s
$ time mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb < test2 . sql
mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb <
0.31 s user 0.39 s system 1% cpu 53.174 t o t a l
1000 i n s e r t i o n s
$ time mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb < test3 . sql
mysql −u root −p∗∗∗∗∗∗∗∗∗∗ testdb <
0.08 s user 0.00 s system 1% cpu 5.318 t o t a l
24
31. References
[1] “Data, data everywhere,” The Economist, February 2010. http://www.
economist.com/node/15557443, accessed 2014-06-22.
[2] Y. Liu, P. Dube, and S. C. Gray, “Run-time performance optimization of a
bigdata query language,” in Proceedings of the 5th ACM/SPEC International
Conference on Performance Engineering, ICPE ’14, (New York, NY, USA),
pp. 239–246, ACM, 2014.
[3] “Om kivra,” Kivra, June 2014. https://www.kivra.com/about/, accessed
2014-06-26.
[4] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large
clusters,” Commun. ACM, vol. 51, pp. 107–113, Jan. 2008.
[5] C. Snijders, U. Matzat, and U.-D. Reips, “Big data: Big gaps of knowledge in
the field of internet science,” International Journal of Internet Science, vol. 7,
no. 1, pp. 1–5, 2012.
[6] S. Tiwari, Professional NoSQL. John Wiley and Sons, 2011.
[7] “Nosql databases explained,” MongoDB, Inc., February 2015. http://
nosql-database.org/, accessed 2015-03-05.
[8] W. Vogels, “Eventually consistent,” Commun. ACM, vol. 52, pp. 40–44, Jan.
2009.
[9] “Sibling resolution in riak,” Richárd Jónás, November 2013. http://
www.jonasrichard.com/2013/11/sibling-resolution-in-riak.html, ac-
cessed 2014-08-05.
[10] “What is riak?,” Basho, June 2014. http://docs.basho.com/riak/latest/
theory/why-riak/, accessed 2014-06-25.
[11] “Basho,” Basho, June 2014. http://basho.com/, accessed 2014-06-28.
[12] “Introducting lager a new logging framework for er-
lang otp,” Basho, July 2011. http://basho.com/
introducing-lager-a-new-logging-framework-for-erlangotp/, ac-
cessed 2014-06-28.
[13] T. Bakuya and M. Matsui, “Relational database management system,” Oct. 21
1997. US Patent 5,680,614.
25
32. [14] B. Beach and D. Platt, “Distributed database management system,” Apr. 27
2004. US Patent 6,728,713.
[15] “Mobilt bankid,” Finansiell ID-Teknik BID AB, April 2015. http://support.
bankid.com/mobil, accessed 2015-04-06.
[16] “Don’t use hadoop - your data isn’t that big,” Chris Stucchio, September
2013. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html,
accessed 2014-08-17.
[17] “Mysql performance study,” Hyperic, April 2011. http://download.
hyperic.com/pdf/Hyperic-WP-MySQL-Benchmark.pdf, accessed 2014-07-
29.
[18] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Workload
analysis of a large-scale key-value store,” SIGMETRICS Perform. Eval. Rev.,
vol. 40, pp. 53–64, June 2012.
[19] L. Richardson and S. Ruby, RESTful web services. " O’Reilly Media, Inc.",
2008.
[20] “Sql as understood by sqlite - aggregate functions,” SQLite.org, September
2014. http://curl.haxx.se/docs/faq.html#What_is_cURL, accessed 2014-
10-08.
[21] D. Bermbach and S. Tai, “Eventual consistency: How soon is eventual? an
evaluation of amazon s3’s consistency behavior,” in Proceedings of the 6th
Workshop on Middleware for Service Oriented Computing, p. 1, ACM, 2011.
[22] Z. Parker, S. Poe, and S. V. Vrbsky, “Comparing nosql mongodb to an sql
db,” in Proceedings of the 51st ACM Southeast Conference, ACMSE ’13, (New
York, NY, USA), pp. 5:1–5:6, ACM, 2013.
[23] “Apache hive tm,” Apache, July 2014. https://hive.apache.org/, accessed
2014-08-05.
[24] “Apache pig,” Apache, June 2014. http://pig.apache.org/, accessed 2014-
08-05.
[25] A. Chalkiopoulos, “Programming mapreduce with scalding,” Packt Publish-
ing, June 2014.
[26] “What is apache hadoop?,” Apache, August 2014. http://hadoop.apache.
org/, accessed 2014-08-05.
26
33. [27] “Sqlite vs mysql vs postgresql: A comparison of relational
database management systems,” Digital Ocean, Inc., February 2014.
https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-
postgresql-a-comparison-of-relational-database-management-systems, ac-
cessed 2014-08-18.
[28] “About mysql,” Oracle Corporation and/or its affiliates, January 2015. http:
//www.mysql.com/about, accessed 2015-03-05.
[29] “About mariadb,” MariaDB Foundation, January 2015. https://mariadb.
org/en/about/, accessed 2015-03-05.
[30] B. Schwartz, P. Zaitsev, and V. Tkachenko, High performance MySQL: Opti-
mization, backups, and replication. "O’Reilly Media, Inc.", 2012.
[31] “Mysql 5.7 reference manual :: 12.16.1 group by (aggregate) functions,” Ora-
cle Corporation, March 2014. http://dev.mysql.com/doc/refman/5.7/en/
group-by-functions.html, accessed 2014-08-30.
[32] “Postgresql - about,” The PostgreSQL Global Development Group, August
2014. http://www.postgresql.org/about/, accessed 2014-12-14.
[33] “Postgresql 9.3.5 documentation - 46.1. the path of a query,” The PostgreSQL
Global Development Group, February 2014. http://www.postgresql.org/
docs/9.3/static/query-path.html, accessed 2014-08-19.
[34] B. Momjian, PostgreSQL: introduction and concepts, vol. 192. Addison-
Wesley New York, 2001.
[35] “Postgresql 9.3.5 documentation - chapter 9,” Oracle Corporation, July 2014.
http://www.postgresql.org/docs/9.3/static/functions-aggregate.
html, accessed 2014-08-30.
[36] C. Newman, SQLite (Developer’s Library). Sams, 2004.
[37] “The architecture of sqlite,” SQLite.org, July 2014. http://www.sqlite.org/
arch.html, accessed 2014-08-19.
[38] “Sqlite4 - pluggable storage engine,” SQLite.org, July 2014. http://sqlite.
org/src4/doc/trunk/www/storage.wiki, accessed 2014-08-30.
[39] “Sql as understood by sqlite - aggregate functions,” SQLite.org, June 2014.
http://www.sqlite.org/lang_aggfunc.html, accessed 2014-08-30.
27
34. [40] “The architecture of sqlite,” SQLite.org, July 2014. http://www.sqlite.org/
faq.html#q19, accessed 2015-03-10.
[41] “Time - linux programmer’s manual,” Linux man-pages project, September
2011. http://man7.org/linux/man-pages/man2/time.2.html, accessed
2014-08-28.
[42] J. D. Foley and J. W. McInroy, “An event-driven data collection and analysis
facility for a two-computer network,” SIGMETRICS Perform. Eval. Rev.,
vol. 3, pp. 106–120, Jan. 1974.
[43] C. Townsend and S. Arms, “Wireless sensor networks,” MicroStrain, Inc, 2005.
[44] W. Pelgrum, Y. Morton, F. van Graas, P. Vikram, and S. Peng, “Multi-domain
analysis of the impact on natural and man-made ionosphere scintillations on
gnss signal propagation,” in Proc. ION GNSS, pp. 617–625, 2011.
28