Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Hortonworks and Teradata have partnered to provide a clear path to Big Analytics via stable and reliable Hadoop for the enterprise. The Teradata® Portfolio for Hadoop is a flexible offering of products and services for customers to integrate Hadoop into their data architecture while taking advantage of the world-class service and support Teradata provides.
In this session we will take a look at Azure Data Lake from an administrator's perspective.
Do you know who has what access where? How much data is in your data lake? What about the accesses to the data lake, is everything running normally?
In this session we will show you what possibilities the portal offers you to keep an eye on the Azure Data Lake. In addition, we will show you further scripts and tools to perform the corresponding tasks.
Dive with us into the depths of your Data Lake.
This document covers guidelines around achieving multitenancy in a data lake environment. It mentions the different design and implementation guidelines necessary for on premise as well as cloud-based multitenant data lake, and highlights the reference architecture for both these deployment options.
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
This document discusses big data and SQL Server. It covers what big data is, the Hadoop environment, big data analytics, and how SQL Server fits into the big data world. It describes using Sqoop to load data between Hadoop and SQL Server, and SQL Server features for big data analytics like columnstore and PolyBase. The document concludes that a big data analytics approach is needed for massive, variable data, and that SQL Server 2012 supports this with features like columnstore and tabular SSAS.
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic applications.
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Hortonworks and Teradata have partnered to provide a clear path to Big Analytics via stable and reliable Hadoop for the enterprise. The Teradata® Portfolio for Hadoop is a flexible offering of products and services for customers to integrate Hadoop into their data architecture while taking advantage of the world-class service and support Teradata provides.
In this session we will take a look at Azure Data Lake from an administrator's perspective.
Do you know who has what access where? How much data is in your data lake? What about the accesses to the data lake, is everything running normally?
In this session we will show you what possibilities the portal offers you to keep an eye on the Azure Data Lake. In addition, we will show you further scripts and tools to perform the corresponding tasks.
Dive with us into the depths of your Data Lake.
This document covers guidelines around achieving multitenancy in a data lake environment. It mentions the different design and implementation guidelines necessary for on premise as well as cloud-based multitenant data lake, and highlights the reference architecture for both these deployment options.
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
This document discusses big data and SQL Server. It covers what big data is, the Hadoop environment, big data analytics, and how SQL Server fits into the big data world. It describes using Sqoop to load data between Hadoop and SQL Server, and SQL Server features for big data analytics like columnstore and PolyBase. The document concludes that a big data analytics approach is needed for massive, variable data, and that SQL Server 2012 supports this with features like columnstore and tabular SSAS.
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic applications.
Testing Big Data: Automated ETL Testing of HadoopBill Hayduk
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
As disparate data volumes continue to be operationalized across the enterprise, data will need to be processed, cleansed, transformed, and made available to end users at greater speeds. Traditional ODS systems run into issues when trying to process large data volumes causing operations to be backed up, data to be archived, and ETL/ ELT processes to fail. Join this breakout to learn how to battle these issues.
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
This document discusses the limitations of traditional ETL processes and how data virtualization can help address these issues. Specifically:
- ETL processes are complex, involve costly data movement between systems, and can result in data inconsistencies due to latency.
- Data virtualization eliminates data movement by providing unified access to data across different systems. It enables queries to retrieve integrated data in real-time without physical replication.
- Rocket Data Virtualization is a mainframe-resident solution that reduces TCO by offloading up to 99% of data integration processing to specialty engines, simplifying access to mainframe data via SQL.
User can run queries via MicroStrategy’s visual interface without the need to write unfamiliar HiveQL or MapReduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.
As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
Motorists insurance company was facing challenges from aging systems, data silos, and an inability to analyze new types of data sources. They partnered with Saama Technologies to implement a hybrid Hadoop and SQL data warehouse ecosystem to consolidate their internal and external data in a scalable and cost-effective manner. This allowed Motorists to gain new insights from claims data, reduce load times by 30% with potential for 70% improvements, and save hundreds of hours on report building. Saama's Fluid Analytics for Insurance solution established a robust data foundation and provided self-service reporting and predictive analytics capabilities. The new environment enabled enterprise-wide data access and advanced analytics to improve business performance.
Jethro + Symphony Health at Qlik QonnectionsRemy Rosenbaum
Presentation from Qlik Qonnections conference 2017. The slides detail how Symphony Health uses Jethro to deliver business users the insights they seek at interactive speed. Learn about Symphony's BI applications and how Jethro transparently accelerates performance at scale at a fraction of the cost.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
The document introduces the Teradata Aster Discovery Platform for scalable analytics using analytic algorithms on commodity hardware. It discusses use cases like credit risk analysis, fraud detection, and sentiment analysis. It provides an overview of the discovery process model and instructions for downloading, installing, and using Aster including setting up the Aster Management Console and AsterLens for visualization. It then provides examples of using various Aster analytic functions like k-means clustering, market basket analysis, data unpacking, and nPath analysis for applications in marketing, pricing, and web analytics. It concludes that Aster provides more powerful analytics capabilities than SQL alone for exploring big data.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Design Principles for a Modern Data WarehouseRob Winters
This document discusses design principles for a modern data warehouse based on case studies from de Bijenkorf and Travelbird. It advocates for a scalable cloud-based architecture using a bus, lambda architecture to process both real-time and batch data, a federated data model to handle structured and unstructured data, massively parallel processing databases, an agile data model like Data Vault, code automation, and using ELT rather than ETL. Specific technologies used by de Bijenkorf include AWS services, Snowplow, Rundeck, Jenkins, Pentaho, Vertica, Tableau, and automated Data Vault loading. Travelbird additionally uses Hadoop for initial data processing before loading into Redshift
This document discusses Capital One's use of AKKA frameworks to implement a parallelized auto loan decisioning workflow called Project IDEAL. It describes how AKKA allows defining actor-based services and message flows to pull credit data, run thousands of loan offers in parallel, and implement conditional decisioning logic. It also provides an overview of Capital One and discusses best practices for building scalable actor-based workflows.
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
Big data is the biggest challenges as we need huge processing power system and good algorithms to make a decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
This report describes how the Aucfanlab team used Azure’s Data Factory service to
implement the orchestration and monitoring of all data pipelines for our “Aucfan
Datalake” project.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.
This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
This document provides an introduction to NoSQL databases. It discusses that NoSQL databases are non-relational, do not require a fixed table schema, and do not require SQL for data manipulation. It also covers characteristics of NoSQL such as not using SQL for queries, partitioning data across machines so JOINs cannot be used, and following the CAP theorem. Common classifications of NoSQL databases are also summarized such as key-value stores, document stores, and graph databases. Popular NoSQL products including Dynamo, BigTable, MongoDB, and Cassandra are also briefly mentioned.
Testing Big Data: Automated ETL Testing of HadoopBill Hayduk
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
As disparate data volumes continue to be operationalized across the enterprise, data will need to be processed, cleansed, transformed, and made available to end users at greater speeds. Traditional ODS systems run into issues when trying to process large data volumes causing operations to be backed up, data to be archived, and ETL/ ELT processes to fail. Join this breakout to learn how to battle these issues.
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
This document discusses the limitations of traditional ETL processes and how data virtualization can help address these issues. Specifically:
- ETL processes are complex, involve costly data movement between systems, and can result in data inconsistencies due to latency.
- Data virtualization eliminates data movement by providing unified access to data across different systems. It enables queries to retrieve integrated data in real-time without physical replication.
- Rocket Data Virtualization is a mainframe-resident solution that reduces TCO by offloading up to 99% of data integration processing to specialty engines, simplifying access to mainframe data via SQL.
User can run queries via MicroStrategy’s visual interface without the need to write unfamiliar HiveQL or MapReduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.
As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
Motorists insurance company was facing challenges from aging systems, data silos, and an inability to analyze new types of data sources. They partnered with Saama Technologies to implement a hybrid Hadoop and SQL data warehouse ecosystem to consolidate their internal and external data in a scalable and cost-effective manner. This allowed Motorists to gain new insights from claims data, reduce load times by 30% with potential for 70% improvements, and save hundreds of hours on report building. Saama's Fluid Analytics for Insurance solution established a robust data foundation and provided self-service reporting and predictive analytics capabilities. The new environment enabled enterprise-wide data access and advanced analytics to improve business performance.
Jethro + Symphony Health at Qlik QonnectionsRemy Rosenbaum
Presentation from Qlik Qonnections conference 2017. The slides detail how Symphony Health uses Jethro to deliver business users the insights they seek at interactive speed. Learn about Symphony's BI applications and how Jethro transparently accelerates performance at scale at a fraction of the cost.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
The document introduces the Teradata Aster Discovery Platform for scalable analytics using analytic algorithms on commodity hardware. It discusses use cases like credit risk analysis, fraud detection, and sentiment analysis. It provides an overview of the discovery process model and instructions for downloading, installing, and using Aster including setting up the Aster Management Console and AsterLens for visualization. It then provides examples of using various Aster analytic functions like k-means clustering, market basket analysis, data unpacking, and nPath analysis for applications in marketing, pricing, and web analytics. It concludes that Aster provides more powerful analytics capabilities than SQL alone for exploring big data.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Design Principles for a Modern Data WarehouseRob Winters
This document discusses design principles for a modern data warehouse based on case studies from de Bijenkorf and Travelbird. It advocates for a scalable cloud-based architecture using a bus, lambda architecture to process both real-time and batch data, a federated data model to handle structured and unstructured data, massively parallel processing databases, an agile data model like Data Vault, code automation, and using ELT rather than ETL. Specific technologies used by de Bijenkorf include AWS services, Snowplow, Rundeck, Jenkins, Pentaho, Vertica, Tableau, and automated Data Vault loading. Travelbird additionally uses Hadoop for initial data processing before loading into Redshift
This document discusses Capital One's use of AKKA frameworks to implement a parallelized auto loan decisioning workflow called Project IDEAL. It describes how AKKA allows defining actor-based services and message flows to pull credit data, run thousands of loan offers in parallel, and implement conditional decisioning logic. It also provides an overview of Capital One and discusses best practices for building scalable actor-based workflows.
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
Big data is the biggest challenges as we need huge processing power system and good algorithms to make a decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
This report describes how the Aucfanlab team used Azure’s Data Factory service to
implement the orchestration and monitoring of all data pipelines for our “Aucfan
Datalake” project.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.
This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
This document provides an introduction to NoSQL databases. It discusses that NoSQL databases are non-relational, do not require a fixed table schema, and do not require SQL for data manipulation. It also covers characteristics of NoSQL such as not using SQL for queries, partitioning data across machines so JOINs cannot be used, and following the CAP theorem. Common classifications of NoSQL databases are also summarized such as key-value stores, document stores, and graph databases. Popular NoSQL products including Dynamo, BigTable, MongoDB, and Cassandra are also briefly mentioned.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
A whitepaper from qubole about the Tips on how to choose the best SQL Engine for your use case and data workloads
https://www.qubole.com/resources/white-papers/enabling-sql-access-to-data-lakes
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
This document discusses combining Apache Spark and MongoDB for real-time analytics. It provides an overview of MongoDB's native analytics capabilities including querying, data aggregation, and indexing. It then discusses how Apache Spark can extend these capabilities by providing additional analytics functions like machine learning, SQL queries, and streaming. Combining Spark and MongoDB allows organizations to perform real-time analytics on operational data without needing separate analytics infrastructure.
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
This document discusses how organizations can save money on database management systems (DBMS) by moving from expensive commercial DBMS to more affordable open-source options like PostgreSQL. It notes that PostgreSQL has matured and can now handle mission critical workloads. The document recommends partnering with EnterpriseDB to take advantage of their commercial support and features for PostgreSQL. It highlights how customers have seen cost savings of 35-80% by switching to PostgreSQL and been able to reallocate funds to new business initiatives.
Orca: A Modular Query Optimizer Architecture for Big DataEMC
This document describes Orca, a new query optimizer architecture developed by Pivotal for its data management products. Orca is designed to be modular and portable, allowing it to optimize queries for both massively parallel processing (MPP) databases and Hadoop systems. The key features of Orca include its use of a memo structure to represent the search space of query plans, a job scheduler to efficiently explore the search space in parallel, and an extensible framework for property enforcement during query optimization. Performance tests showed that Orca provided query speedups of 10x to 1000x over previous optimization systems.
This document discusses combining Apache Spark and MongoDB for real-time analytics. It describes how MongoDB provides rich analytics capabilities through queries, aggregations, and indexing. Apache Spark can further extend MongoDB's analytics by offering additional processing capabilities. Together, Spark and MongoDB enable organizations to perform real-time analytics directly on operational data without needing separate analytics infrastructure.
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
In today's computing world, accessing and managing data has become one of the most significant elements. Applications as
varied as weather satellite feedback to military operation details employ huge databases that store graphics images, texts and other
forms of data. The main concern in maintaining this information is to access them in an efficient manner. Database optimization
techniques have been derived to address this issue that may otherwise limit the performance of a database to an extent of vulnerability.
We therefore discuss the aspects of performance optimization related to data access in distributed databases. We further looked at the
effect of these optimization techniques
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
In today's computing world, accessing and managing data has become one of the most significant elements. Applications as
varied as weather satellite feedback to military operation details employ huge databases that store graphics images, texts and other
forms of data. The main concern in maintaining this information is to access them in an efficient manner. Database optimization
techniques have been derived to address this issue that may otherwise limit the performance of a database to an extent of vulnerability.
We therefore discuss the aspects of performance optimization related to data access in distributed databases. We further looked at the
effect of these optimization techniques.
CXAIR is a search-based business analytics tool that uses the same principles as internet search engines like Google to provide reporting and analysis capabilities. Unlike traditional BI tools, CXAIR indexes data from multiple sources into a search engine for fast retrieval through natural language searches. It can report on both structured and unstructured data at scale without SQL or pre-aggregations. CXAIR provides an alternative to established BI approaches like OLAP and in-memory analytics through its use of search technology.
How Microsoft Synapse Analytics Can Transform Your Data Analytics.pdfAddend Analytics
In this article, I’ll show you some of the key features and benefits of Microsoft Synapse Analytics and how it can help you accelerate time to insight across your data landscape.
SSIS is a component of SQL Server that allows for data integration and workflow. It has separate runtime and data flow engines. The runtime engine manages package execution and control flow, while the data flow engine extracts, transforms, and loads data in a parallel, buffered manner for improved performance. SSAS is the analysis component that builds multidimensional cubes from relational data sources for analysis. It uses an OLAP storage model and has components for querying, processing, and caching data and calculations. SSRS is the reporting component that allows users to build interactive, parameterized reports from various data sources and deliver them through a web portal.
Building High Performance MySQL Query Systems and Analytic ApplicationsCalpont
This presentation describes how to build fast running MySQL applications that service read-based systems. It takes a special look at column databases and Calpont's InfiniDB
Building High Performance MySql Query Systems And Analytic Applicationsguest40cda0b
This presentation gives practical advice and tips on how to build high-performance read intensive databases, and discusses innovations such as column-oriented databases
Data warehousing interview_questionsandanswersSourav Singh
A data warehouse is a repository of integrated data from multiple sources that is organized for analysis. It contains historical data to support decision making. There are four fundamental stages of data warehousing: offline operational databases, offline data warehouse, real-time data warehouse, and integrated data warehouse. Dimensional modeling involves facts tables containing measurements and dimension tables containing context for the measurements. (191 words)
This document outlines the key issues in data warehousing and online analytical processing (OLAP) in e-business. It defines data warehousing as a centralized repository that extracts and integrates relevant information from multiple sources to support decision making. OLAP enables multi-dimensional analysis of data stored in a database. The problem is that operational databases are not optimized for analytical reporting. This leads to slow response times for reports needed by top management. The research aims to develop a tool to convert operational databases into a star schema structure for faster multi-dimensional analysis and reporting.
A whitepaper is about How big data engines are used for exploring and preparing data, building pipelines, and delivering data sets to ML applications.
https://www.qubole.com/resources/white-papers/big-data-engineering-for-machine-learning
Similar to QUERY OPTIMIZATION FOR BIG DATA ANALYTICS (20)
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
DOI: 10.5121/ijcsit.2019.11506 77
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
Manoj Muniswamaiah, Tilak Agerwala and Charles Tappert
Seidenberg School of CSIS, Pace University, White Plains, New York
ABSTRACT
Organizations adopt different databases for big data which is huge in volume and have different data
models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally
built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the
ever-growing demand of data. With rapid change in requirements it is important to have near real time
response from the big data gathered so that business decisions needed to address new challenges can be
made in a timely manner. The main focus of our research is to improve the performance of query execution
for big data.
KEYWORDS
Databases, Big data, Optimization, Analytical Query, Data Analysts and Data Scientists.
1. INTRODUCTION
Big data analytics is a process of gathering and analyzing data which is immense in volume,
variety and velocity for making informed business decisions and take appropriate actions. There
are different databases with varied data models to store and query big data: Columnar databases
are used for read heavy analytical queries; online transactional processing databases are used for
quicker writes and consistency; NoSQL data stores provide high read and write scalability. They
are used in parallel processing of large volume of data and for horizontal distributed scaling;
HTAP is a single database which can perform both OLTP and OLAP by utilizing in-memory
computation. It is a hybrid of both OLTP and columnar database which can also be used for
machine learning applications [1].
There are various data stores that are designed and developed to handle specific needs of big data
and for optimizing performance. Relational databases can store and process structured data
efficiently but its lacks scalability, elasticity needed to handle the inflow of big data. Its
performance also decreases with read heavy queries. Similarly, columnar databases facilitate
faster reads. This makes it a better choice for analytical query processing. NoSQL data stores
provide horizontal scaling which makes it easier to add more resources on demand and thus are
specialized to handle unstructured data. The processed data is then stored in different databases to
be used by analysts for their business needs [1].
Performance optimization and different data models are important for data intensive applications.
The main challenge is to build data pipelines that are scalable, interactive, efficient and fault
tolerant. Data engineers optimize and maintain these data pipelines to ensure smooth data flow
which are crucial for the performance of the applications. Data engineers primary job is to
partition, normalize, index the base tables in order to improve the performance of the queries
which are used for dashboards and reports. These optimized copies would be stored in different
databases based on their data models for querying and quicker response. In this research we want
to automate this process where if data scientists issues a query against any desired database the
2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
78
query framework would be able to detect the optimized copies created and stored by data
engineers and execute the query against optimized copy rather than the base table which would
improve the performance of the query response.
2. BACKGROUND
There are several techniques to improve query performance and response time like partitioning
the base table, indexing on required columns, materialization and creating OLAP cubes. Indexing
helps in faster reads and retrieval of data. It is similar to data dictionary lookup, B-tree indexing
keeps the data sorted and allows for sequential access of the data [2]. Consistency and freshness
of the optimized copies is maintained by updating them whenever the base table changes.
BigDAWG, is a polystore framework implemented to support multiple databases compactible
with different data models. The main feature of BigDAWG is to provide location independence
which routes the queries to the desired databases and semantic completeness which lets the
queries to make use of the native database features effectively. Shims translate incoming queries
into their respective native database queries to take the advantages of the database features. One
of the important features provided by BigDAWG is casting where data from one island is been
converted into a model suitable for another island to enable inter-island queries and faster data
migration between their underlying databases. The optimizer component parses the request,
checks for syntax validation and creates a query plan to be executed by the engines. The monitor
component uses statistics gathered from the previous queries to pick up the best query plan for
execution. The migrator component is responsible for the movement of data between the data
engines [3].
Apache Kylin is an open source distributed online analytic processing engine developed by eBay
inc. It is built to support both ROLAP and MOLAP analytics. It provides SQL interface and sub-
second query latency for large datasets which boosts its query performance. OLAP cubes are built
offline using map reduce process. It reads data from source table and later uses map reduce to
build cuboids of all combinations at each level. The cubes are generally stored in columnar
database for faster reads. There are multiple cubing algorithms used for implementation. Recently
Apache Spark is been used to speed up the cube build process which is stored in HBase data
store. When users issue a query they are routed to be executed against the prebuilt cubes for
quicker response, if the desired cube does not exists then the query would be executed against the
Hadoop data [4].
Myria is an academia analytical engine been developed and provided as a cloud service. MyriaX
is a share-nothing relational analytical query execution engine which efficiently executes the
queries. It supports federated analytics and alleviates big data analysis. MyriaL is the query
language supported by Myria for querying [5]
Apache Hive supports adhoc-queries and is used for batch processing of big data. It converts
SQL-like queries in to map reduce jobs and executes the queries. HiveQL is the query language
used for processing of the data [6].
Kodiak is an analytical distributed data platform which uses materialized views to process
analytical queries. Various levels of materialized views are created and stored over petabytes of
data. Kodiak platform can maintain these views efficiently and update them automatically. In
Kodiak the query execution was three times faster and also used less resources [7].
Apache Lens is another open source framework which tries to provide a unified analytical layer
3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
79
of Hadoop and databases ecosystem using a REST server and query language CubeQL to store
and process the data cubes [8].
Analytical queries whose aggregates have been stored as OLAP cubes are used in reports and
dashboards. These cubes often need to be updated with latest aggregates. Multiple databases are
used within an organization for various tasks to store and retrieve the data. Data engineers
implement data pipelines to optimize datasets which would be later used by data analysts and data
scientists for research. Creating optimized copies involves partitioning and indexing of the base
tables. These optimized copies would later be stored in different databases for querying.
Our research focus and solution is to implement a query framework that routes the data analysts
and data scientists queries to the optimized copies created by data engineers which are stored,
maintained, updated automatically to achieve better query performance and reduce response time.
3. QUERY OPTIMIZER
Apache calcite framework is an open source database querying framework which uses relational
algebra for query processing. It parses the incoming coming query and converts them to logical
plans and later various transformations would be applied to convert this logical plan into an
optimized plan which has low cost in execution. This optimized logical plan would be converted
into physical plan to be executed against the databases by using traits. Query optimizer eliminates
logical plans which increase cost of the query execution based on cost model been defined.
Apache Calcite Schema contains details about the data formats present in the model which is used
by schema factory to create schema [9].
The query optimizer applies the planner rules to the relational node and generates different plans
with reduced cost by retaining the original semantics of the query. When a rule matches a pattern
query optimizer executes the transformations by substituting the subtree into the relation
expression and also preserves the semantics of it. These transformations are specific to each
database. The metadata component provides the query optimizer with details of the overall cost of
the execution of the relational expression and also the degree of parallelism that can be achieved
during execution [9].
The query optimizer uses cost-based dynamic programming algorithm which fires rules in order
to reduce the cost of the relational expression. This process continues until the cost of the
relational expression is not improved subsequently. The query optimizer takes into consideration
CPU cycles, memory been used and IO resource utilization cost to execute the relational
expression [9].
One of the techniques which is used to improve query processing is to use the optimized copies
been created by data engineers. The query optimizer needs to have the ability to make use of
these optimized copies to rewrite the incoming queries. Optimizer does this by substituting part of
the relational expression tree with optimized copies which it uses to execute the query and return
the response.
Algorithm
Optimization of relational expression R:
1. Register the relational expression R.
a. Check for the existence of appropriate optimized copy which can be used to substitute
4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
80
the relational expression R.
b. If the optimized copy exists, then register the new relational expression R1 of the
optimized copy.
c. Trigger the transformation rules on relational expression R1 for cost optimization.
d. Obtain the best relational expression R1 based on cost and execute it against the
database.
2. If the relational expression R cannot be substituted with an existing optimized copy
a. Trigger the transformation rules on relational expression R.
b. Obtain the best relational expression based on cost and execute it against the
database[10].
4. ARCHITECTURE
Figure 1. Architecture of the SQL Framework
The analytical queries been executed would be parsed and different logical plans would be
generated. The query parser determines if the parsed relational expression can be substituted with
the registered optimized copies been created by data engineers automatically. It executes various
rules to obtain relational expression with minimal cost. The query would then be rewritten to be
executed against the optimized copy and the results would be returned [10].
SQL-Tool: Includes any tool which can be integrated using JDBC connection and execute SQL
analytical queries.
Query Engine: Apache Calcite open source framework that includes query optimizer been
extended to include optimized copies.
Metadata: Contains information about the schema and the features of the native databases.
Optimized copy: Optimized tables created by the data engineers.
RDBMS: Includes any relational database to store structured data.
HTAP: Hybrid database which has the scalability of NoSQL data stores.
5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
81
5. EVALUATION
Analytical queries help in data driven decision making process. In this paper we used NYC Taxi
and Limousine Commission (TLC) datasets provided under the authorization of Taxicab &
Livery Passenger Enhancement Programs consisting of the green and yellow taxi data [11].
The following experimental setup was made to benchmark and evaluate the query optimizer
performance to find the optimized copy of the data and substitute it in the query plan during run
time. Data engineers usually store these optimized copies in the columnar database for faster read
access. The tables and data used for the query evaluation was obtained from NYC taxi dataset
[11]. Data was stored in Mysql [12], Splice Machine [13] and Vertica database [14].
Evaluation was obtained based on the following setup
a.) Base tables had rows ranging from ~10,000,000 to ~20,000,000
b.) Optimized copy tables had rows ranging from ~1,000 to ~4,000,000
c.) Mysql, Splice Machine and Vertica database was running on single node instance with
Intel Quad Core Xeon 3.33GHz, 24 GB RAM and 1TB HDD.
d.) Base table data was stored in HDFS [15].
e.) Optimized copies were stored in Vertica database.
f.) SQL query optimizer used cost based dynamic algorithm to substitute optimized copy in
the query plan.
Figure 2. Query Response Time
The two bar graphs show the analytical queries been executed by the query optimizer against the
base table and the optimized copy. The query optimizer was successfully able to substitute the
optimized copy in the query plan when it existed and improved the performance of the query and
its response time.
6. CONCLUSION
In this research we were able to make extensions to cost based query optimizer to make use of the
optimized copies to obtain an improved query response time and performance. Various analytical
queries were executed to measure the results obtained. The query optimizer was successfully able
to substitute the optimized copies when existed during the runtime and modify the incoming
query to execute against it. Part of the future work is to extend this query optimizer to the cloud
databases.
6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 11, No 5, October 2019
82
REFERENCES
[1] Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., et al. (2015). The
BigDAWG Polystore System. ACM Sigmod Record, 44(3)
[2] V. Srinivasan and M. Carey. Performance of B-Tree Concurrency Control Algorithms. In Proc.ACM
SIGMOD Conf., pages 416–425, 1991
[3] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel,V. Gadepally, J. Heer, B.
Howe, J. Kepner, T. Kraskaet al., “A demonstration of the bigdawg polystore system,”Proceedings
of theVLDB Endowment, vol. 8, no. 12, pp. 1908–1911, 2015
[4] http://kylin.apache.org
[5] D. Halperin et al. Demonstration of the myria big data management service. In SIGMOD, pages
881–884, 2014.
[6] Fuad, A., Erwin, A. and Ipung, H.P., 2014, September. Processing performance on Apache Pig,
Apache Hive and MySQL cluster. In Information, Communication Technology and System (ICTS),
2014 International Conference on (pp. 297-302). IEEE.
[7] Liu, Shaosu, et al. "Kodiak: leveraging materialized views for very low-latency analytics over high-
dimensional web-scale data." Proceedings of the VLDB Endowment9.13 (2016): 1269-1280
[8] https://lens.apache.org/
[9] https://calcite.apache.org/
[10] Muniswamaiah, Manoj & Agerwala, Tilak & Tappert, Charles. (2019). Query Performance
Optimization in Databases for Big Data. 85-90. 10.5121/csit.2019.90908.
[11] https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
[12] Luke Welling, Laura Thomson, PHP and MySQL Web Development, Sams, Indianapolis, IN, 2001
[13] https://www.splicemachine.com/
[14] C. Bear, A. Lamb, and N. Tran. The vertica database: Sql rdbms for managing big data. In
Proceedings of the 2012 workshop on Management of big data systems, pages 37–38.ACM, 2012
[15] Cong Jin, Shuang Ran, "The research for storage scheme based on Hadoop", Computer and
Communications (ICCC) 2015 IEEE International Conference on, pp. 62-66, 2015.