This white paper explains how JethroData can help you achieve truly interactive response times for BI on big data, and how the underlying technology works.
It analyzes the challenges of implementing indexes for big data and how JethroData solved these challenges. It then discusses how the JethroData design of separating compute from storage works with Hadoop and with Amazon S3. Finally, it briefly discusses some of the main features behind JethroData's performance, including I/O, query planning and execution features.
Tired of seeing the loading spinner of doom while trying to analyze your big data on Tableau? Learn how Jethro accelerates your database so you can interactively analyze your big data on Tableau and gain the crucial insights that you need without losing your train of thought. Jethro enables you to be completely flexible with no need for partitions in order to speed up the data. This presentation will explain how indexing is a superior architecture for the BI use case when dealing with big data while compared to MPP architecture.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
Â
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
Â
Jethro CTO Boaz Raufman and Jethro CEO Eli Singer discuss the performance benefits of adding auto microcubes to the processing framework in Jethro 2.0. They discuss how the auto microcubes working in tandem with full indexing and a smart caching engine deliver a consistently interactive-speed business intelligence experience across most scenarios and use cases. The main use case they discuss is querying data on Hadoop directly from a BI tool such as Tableau or Qlik.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data â including SELECT, JOIN, and aggregate functions â in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoopâ
Hadoop Analytics Tools
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
Tired of seeing the loading spinner of doom while trying to analyze your big data on Tableau? Learn how Jethro accelerates your database so you can interactively analyze your big data on Tableau and gain the crucial insights that you need without losing your train of thought. Jethro enables you to be completely flexible with no need for partitions in order to speed up the data. This presentation will explain how indexing is a superior architecture for the BI use case when dealing with big data while compared to MPP architecture.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
Â
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
Â
Jethro CTO Boaz Raufman and Jethro CEO Eli Singer discuss the performance benefits of adding auto microcubes to the processing framework in Jethro 2.0. They discuss how the auto microcubes working in tandem with full indexing and a smart caching engine deliver a consistently interactive-speed business intelligence experience across most scenarios and use cases. The main use case they discuss is querying data on Hadoop directly from a BI tool such as Tableau or Qlik.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data â including SELECT, JOIN, and aggregate functions â in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoopâ
Hadoop Analytics Tools
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
High concurrency,â¨Low latency analyticsâ¨using Spark/KuduChris George
Â
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
This document discusses leveraging major market opportunities with Microsoft Azure. It notes that worldwide cloud software revenue is expected to grow significantly between 2010-2017. By 2017, nearly $1 of every $5 spent on applications will be consumed via the cloud. It also notes that hybrid cloud deployments will be common for large enterprises by the end of 2017. The document then outlines several major enterprise workloads that can be moved to Azure, including test/development, SharePoint, SQL/business intelligence, application migration, SAP, and identity/Office 365. It provides examples of how partners can help customers with these types of migrations.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Â
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and whatâs the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Clouderaâs enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Â
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
Data Discovery at Databricks with AmundsenDatabricks
Â
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
Â
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, whatâs new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into whatâs next for Impala.
The document discusses when to use Hadoop instead of a relational database management system (RDBMS) for advanced analytics. It provides examples of when queries like count distinct, cursors, and alter table statements become problematic in an RDBMS. It contrasts analyzing simple, transactional data like invoices versus complex, evolving data like customers or website visitors. Hadoop is better suited for problems involving complex objects, self-joins on large datasets, and matching large datasets. The document encourages structuring data in HDFS in a flexible way that fits the problem and use cases like simple counts on complex objects, self-self-self joins, and matching problems.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the âspeed of thoughtâ
- Reduce data movement between systems & eliminate double storage
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
Â
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.
This document discusses modern data architecture and Apache Hadoop's role within it. It presents WANdisco and its Non-Stop Hadoop solution, which extends HDFS across multiple data centers to provide 100% uptime for Hadoop deployments. Non-Stop Hadoop uses WANdisco's patented distributed coordination engine to synchronize HDFS metadata across sites separated by wide area networks, enabling continuous availability of HDFS data and global HDFS deployments.
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
Â
Jai Ranganathan, Senior Director of Product Management, discusses why Spark has experienced such wide adoption and provide a technical deep dive into the architecture. Additionally, he presents some use cases in production today. Finally, he shares our vision for the Hadoop ecosystem and why we believe Spark is the successor to MapReduce for Hadoop data processing.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
Â
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
Â
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
This document provides an agenda and overview for a presentation on SQL on Hadoop. The presentation will cover various SQL on Hadoop technologies including Hive, HAWQ, Impala, SparkSQL, HBase with Phoenix, and Drill. It will also include an introduction, surveys to collect information from attendees, and discussions on networking and food. The hosts will provide background on their experience with big data and Hadoop.
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
Hive is a data warehousing infrastructure built on Hadoop that allows users to query and analyze large datasets using SQL-like queries. This document provides an overview of Hive and demonstrates how to use basic Hive commands such as creating and altering tables, importing and exporting data, running queries, and more. Key points covered include the differences between managed and external tables, Hive data types and units, and how to perform operations like selecting, filtering, and limiting query results.
High concurrency,â¨Low latency analyticsâ¨using Spark/KuduChris George
Â
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
This document discusses leveraging major market opportunities with Microsoft Azure. It notes that worldwide cloud software revenue is expected to grow significantly between 2010-2017. By 2017, nearly $1 of every $5 spent on applications will be consumed via the cloud. It also notes that hybrid cloud deployments will be common for large enterprises by the end of 2017. The document then outlines several major enterprise workloads that can be moved to Azure, including test/development, SharePoint, SQL/business intelligence, application migration, SAP, and identity/Office 365. It provides examples of how partners can help customers with these types of migrations.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Â
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and whatâs the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Clouderaâs enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Â
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
Data Discovery at Databricks with AmundsenDatabricks
Â
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
Â
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, whatâs new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into whatâs next for Impala.
The document discusses when to use Hadoop instead of a relational database management system (RDBMS) for advanced analytics. It provides examples of when queries like count distinct, cursors, and alter table statements become problematic in an RDBMS. It contrasts analyzing simple, transactional data like invoices versus complex, evolving data like customers or website visitors. Hadoop is better suited for problems involving complex objects, self-joins on large datasets, and matching large datasets. The document encourages structuring data in HDFS in a flexible way that fits the problem and use cases like simple counts on complex objects, self-self-self joins, and matching problems.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the âspeed of thoughtâ
- Reduce data movement between systems & eliminate double storage
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
Â
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.
This document discusses modern data architecture and Apache Hadoop's role within it. It presents WANdisco and its Non-Stop Hadoop solution, which extends HDFS across multiple data centers to provide 100% uptime for Hadoop deployments. Non-Stop Hadoop uses WANdisco's patented distributed coordination engine to synchronize HDFS metadata across sites separated by wide area networks, enabling continuous availability of HDFS data and global HDFS deployments.
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
Â
Jai Ranganathan, Senior Director of Product Management, discusses why Spark has experienced such wide adoption and provide a technical deep dive into the architecture. Additionally, he presents some use cases in production today. Finally, he shares our vision for the Hadoop ecosystem and why we believe Spark is the successor to MapReduce for Hadoop data processing.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
Â
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
Â
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
This document provides an agenda and overview for a presentation on SQL on Hadoop. The presentation will cover various SQL on Hadoop technologies including Hive, HAWQ, Impala, SparkSQL, HBase with Phoenix, and Drill. It will also include an introduction, surveys to collect information from attendees, and discussions on networking and food. The hosts will provide background on their experience with big data and Hadoop.
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
Hive is a data warehousing infrastructure built on Hadoop that allows users to query and analyze large datasets using SQL-like queries. This document provides an overview of Hive and demonstrates how to use basic Hive commands such as creating and altering tables, importing and exporting data, running queries, and more. Key points covered include the differences between managed and external tables, Hive data types and units, and how to perform operations like selecting, filtering, and limiting query results.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
Â
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
User can run queries via MicroStrategyâs visual interface without the need to write unfamiliar HiveQL or MapReduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.
This document provides an overview and configuration instructions for Hadoop, Flume, Hive, and HBase. It begins with an introduction to each tool, including what problems they aim to solve and high-level descriptions of how they work. It then provides step-by-step instructions for downloading, configuring, and running each tool on a single node or small cluster. Specific configuration files and properties are outlined for core Hadoop components as well as integrating Flume, Hive, and HBase.
The Hadoop platform uses the Hadoop Distributed File System (HDFS) to reliably store large files across thousands of nodes. It requires a minimum of computing power, memory, storage, and network bandwidth. A recommended cluster size depends on linear relationships between resources and efficiency. Dashboards can be created using data extracted from HDFS to SQL for analytics. The Hadoop architecture is designed to scale easily by adding more servers as data and workloads increase.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Elasticsearch and graph databases like Titan are discussed as potential technologies for search scenarios like code lens, semantic search, and tech debt. Elasticsearch is described as an inverted index designed for search that allows building indices over document fields. Titan is a graph database that represents data as nodes and relationships. It is noted that Titan recommends using an external index like Elasticsearch for certain queries. The appendix discusses Elasticsearch's extensibility, use of Lucene, high availability, scaling, and aggregations as benefits for a search platform. It also notes that Elasticsearch is not designed as a reporting or analytics platform.
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
Â
čŹč ďźInformatica čłćˇąç˘ĺ饧ĺ | ĺ°šĺŻć
č°éĄç°ĄäťďźBig Data ć䝣ďźćŻçä¸ćŻć¸ćć¸éďźčćŻäşč§Łć¸ćç桹庌ăçžĺ¨ďźĺ çş Big Data ćčĄçćçďźčŽéčłč¨čćŻç CXO ĺďźĺŻäťĽčŽéĺťĺćŻĺ°ćĺčŠç CI (Customer Intelligence) čŽćĺčŠďźĺž BI é˛ĺ Ľ CIďźć´éŁçľćśč˛ťč çśćżçčĺďźć´ć饧厢çćĺăä¸éďźćĺ Big Data ćäťŁčŚ ćł¨ćçćçśďźéŁĺ°ąćŻçŤśçĺ°ćĺžďźä¸ĺŽĺŞćŻçć¸ćéçĺ˘éˇďźéčŚćŻčŞ°č˝ć´äşč§Łć¸ćç桹庌ăč Informatica ćŁćŻéĺć佳解湺ççćĄăćĺéé Informatica 解湺ĺ¨äźćĽĺććäžĺŻäżĄčł´ć¸ćç塨大ĺŁĺ;ĺćé¨čćĽçĺ˘éŤçć¸ćéĺč¤éç¨ĺşŚďźInformatica äšćč˝ĺćäžć´ĺżŤéĺ˝éć¸ććčĄďźĺžččŽć¸ćčŽçćć瞊丌ĺŻäžäźćĽç¨äžäżé˛ćçćĺăĺŽĺĺ質ăäżč確ĺŽć§ĺçźćŽĺŞĺ˘çĺč˝ăInforamtica ćäžäşć´çşĺżŤéććĺ°ĺŻŚçžć¤çŽć¨çćšćĄďźćŻç˛žčŞ éĺĺ¨ Big Data ć䝣çćä˝łĺˇĽĺ ˇă
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
Â
Which storage technology, HDDs or SSDs, excels in big data architecture? SSDs clearly win on speed, offering higher sequential read/write speeds and higher IOPS. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins. This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology.
This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Big Data Analysis and Its Scheduling Policy â HadoopIOSR Journals
Â
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Â
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
This document provides definitions and brief explanations of key terms related to Kognitio's analytical platform and database technologies. Some of the key terms defined include 10GbE networking, ACID compliance, Amazon Web Services, analytical platforms, analytical workloads, blade servers, cores, CPUs, data warehouses, databases appliances, dimensions, disk storage, elastic block store, ETL processes, external scripting, external tables, in-memory databases, JDBC, latency, linear scalability, massively parallel processing, MDX, measures, memory, nodes, NoSQL databases, ODBC, OLAP, OLTP, parallel processing, persistence layers, private clouds, public clouds, and R language.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
Â
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
Â
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
Â
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Â
An English đŹđ§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech đ¨đż version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Â
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. đ This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. đť
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. đĽď¸
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. đ
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
Â
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
Â
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of whatâs possible in finance.
In summary, DeFi in 2024 is not just a trend; itâs a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Â
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Â
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind fĂźr viele in der HCL-Community seit letztem Jahr ein heiĂes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und LizenzgebĂźhren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer mĂśglich. Das verstehen wir und wir mĂśchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lÜsen kÜnnen, die dazu fßhren kÜnnen, dass mehr Benutzer gezählt werden als nÜtig, und wie Sie ßberflßssige oder ungenutzte Konten identifizieren und entfernen kÜnnen, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnÜtigen Ausgaben fßhren kÜnnen, z. B. wenn ein Personendokument anstelle eines Mail-Ins fßr geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren LÜsungen. Und natßrlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Ăberblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und ĂźberflĂźssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps fßr häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Fueling AI with Great Data with Airbyte WebinarZilliz
Â
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Â
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Â
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Â
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether youâre at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. Weâll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
1. Achieving truly interactive response times on Hadoop and Amazon S3
Making Interactive BI for Big
Data a Reality
Technical White Paper April 2015
2. JethroData Technical White Paper 2
Table of Contents
Introduction ....................................................................................................................... 3
Â
Part 1: Indexes for Big Data.............................................................................................. 4
Â
Part 2: JethroData Architecture......................................................................................... 8
Â
Part 3: JethroData Performance Features ...................................................................... 13
Â
Summary......................................................................................................................... 17
Â
More Information............................................................................................................. 18
Â
3. JethroData Technical White Paper 3
Introduction
This white paper explains how JethroData can help you achieve a truly
interactive response time for BI on big data, and how the underlying
technology works. It starts by analyzing the challenges of implementing
indexes for big data and how JethroData indexes have solved these
challenges. Then, it discusses how the JethroData design of
separating compute from storage works with Hadoop and with Amazon
S3. Finally, it briefly discusses some of the main features behind
JethroData's performance â including I/O, query planning and
execution features.
4. JethroData Technical White Paper 4
Part 1: Indexes for Big Data
Why Use Indexes?
Indexes are a fundamental performance feature of any high-
performance database. Indexes accelerate selective queries by orders
of magnitude, as they allow the query engine to fetch and process only
relevant data, instead of scanning all the data in the tables.
In the past decade, indexes have been delegated to a minor role in
high performance analytic databases, for several reasons:
Ă¨ď¨ Maintaining standard indexes is expensive
Using standard indexes dramatically slows down data loading,
especially with bigger data sources. This is true for most index
implementations, including B-tree, bitmap and others.
Technically, indexes are typically divided into small blocks (e.g., every
8kb). These blocks are updated as new data is inserted â updating the
list of row IDs per value. So adding new rows to a table leads to many
random writes across all indexes.
As a consequence, administrators typically only index a few columns in
advance. Consequently, many ad hoc queries do not benefit from
indexes. Also, adding new indexes is only done after careful testing
and long negotiation to minimize index impact on the data loading
process.
Ă¨ď¨ Standard indexes are not compatible with HDFS or Amazon S3
With standard indexes, adding rows to a table leads to many random
updates to the index files. While standard (POSIX) file systems support
random updates, modern big data storage solutions like HDFS and S3
do not. With HDFS, existing files cannot be updated â HDFS only
supports appending new data at the end of an existing file (the MapR
5. JethroData Technical White Paper 5
HDFS implementation is the only exception). Amazon S3 does not
support random writes (or file append) as well.
Consequently, it is impossible to implement standard indexes on HDFS
or Amazon S3. As an example of this challenge, when Pivotal migrated
their Greenplum database to HDFS (and renamed it HAWQ), they
removed support for indexes.
Ă¨ď¨ Indexes don't accelerate all types of queries
Indexes do not provide a performance advantage for non-selective
queries, meaning queries that need to process a large percentage of
the data (as we will see later, JethroData has a different mechanism to
optimize these types of queries).
As a result, most SQL engines introduced in the last decade were
designed with an MPP architecture: distributing the data across many
nodes, where each node processes all of its local data for every query.
It is a simple architecture that handles large aggregate queries well,
but requires many servers to be effective, consumes many resources
per query across the cluster (limiting SQL concurrency) and does not
benefit from the dramatic performance advantage that indexes provide.
.
6. JethroData Technical White Paper 6
JethroData Indexes
JethroData indexes were specifically designed to solve the first two
problems. In other words, they were designed to work natively on
HDFS and Amazon S3 without the expensive index update overhead.
As index maintenance is relatively cheap in JethroData, it allows every
single column to be automatically indexed, thus accelerating many
more queries without requiring an administrator decision for every
column ("to index or not to index?").
Figure 1: JethroData Indexing Flow
How is that possible?
When new rows are loaded into a table, JethroData does not go back
to the existing index blocks and updates them. Instead, it adds them to
the end of the index, allowing duplicate index entries.
When new data is loaded, the JethroLoader utility generates a large
sequential write at the end an each index, instead of the conventional
method of updating many index entries in-place. This leads to a
dramatically lower impact of each index on the load process, compared
to standard indexes.
7. JethroData Technical White Paper 7
During query execution, the JethroServer engine reads all the
relevant index fragments. For example, if a WHERE clause limits results
to a specific country (e.g., WHERE country = 'FR'), the engine may
read and processes several smaller bitmap fragments, each covering a
different range of rows.
At a later point, a background process (JethroMaint) periodically
merges duplicate index entries by replacing several smaller index files
with a single larger file and removing duplicate value entries. This is
done in a non-blocking fashion - without affecting running queries or
other data loads. It is similar in spirit to how Apache HBase performs
compactions to reduce the number of files and remove duplicates and
expired or deleted rows
In summary, JethroData solves the challenge of indexing at scale,
automatically index all columns, and enables users to enjoy the
performance of indexing without compromise.
8. JethroData Technical White Paper 8
Part 2: JethroData Architecture
The JethroData engine is installed on its own dedicated query cluster â
one or more nodes. The JethroData nodes are connected to a
scalable, big data storage solution â either Hadoop HDFS or Amazon
S3. The JethroData query nodes themselves are stateless: all of their
indexes are stored in HDFS or S3. JethroData users typically work with
a BI tool, such as Qlik, Tableau or MicroStrategy, over ODBC/JDBC to
enable fast, interactive BI.
In this chapter we will first discuss JethroData's architecture in the
context of Hadoop and HDFS. We will then go over specific attributes
of JethroData for Amazon S3.
Figure 2: jethroData Architecture
9. JethroData Technical White Paper 9
JethroData for HDFS
When connected to Hadoop, JethroData is not installed directly on the
Hadoop cluster nodes. Instead, it is installed on its own set of
dedicated nodes next to the Hadoop cluster (sometimes called edge
nodes). It does, however, store all its data on HDFS, so it works as a
standard HDFS client.
This architecture has several benefits:
Ă¨ď¨ Ease of deployment
Separating JethroData to a few dedicated nodes means that no
change is required on the Hadoop cluster â no new component to
install and no configuration changes. All that is required is a dedicated
user with access to its own HDFS directory to store the indexes. This
means it is very easy and safe to try out or deploy JethroData.
Ă¨ď¨ Query-optimized nodes
In many Hadoop clusters, the sizing of the physical nodes is optimized
either for batch processing or as a compromise between several
workloads, minimizing costs. As newer, memory-optimized
technologies emerge, modifying existing cluster hardware to
accommodate more RAM or a local SSD is a complicated task â both
effort-wise and cost-wise. With JethroData, you can easily deploy one
or more higher-powered nodes to support interactive BI users, without
the need to upgrade existing nodes in the cluster.
Ă¨ď¨ Better co-existence with other workloads
Most Hadoop vendors envision Hadoop as a central computing
resource that is shared across many different workloads (sometimes
called an Enterprise Data Hub or Lake). However, most SQL-on-
Hadoop technologies can't easily share the cluster. They are designed
to consume a significant portion of memory, CPU and I/O bandwidth
from every node in the cluster at once, to provide a decent response
10. JethroData Technical White Paper 10
time ("MPP architecture"). This leads to an inherent contention over
resources amongst interactive BI workloads and other workloads in the
cluster. As more users are added it becomes increasingly harder to
manage. With JethroData, the query processing is done in JethroData-
dedicated nodes, so that interactive BI users have a minimal impact on
the Hadoop cluster (just some HDFS accesses). This enables a true
sharing of the cluster between interactive BI users and batch
workloads.
Ă¨ď¨ Easy scale-out for concurrency
JethroData query nodes are stateless. As you add more concurrent
users, you can easily start additional JethroData nodes to support
them, and remove them when they are no longer needed. In other
words, it enables scaling compute (query servers) and storage (HDFS)
separately.
Also, with JethroData client-side load balancing, the ODBC/JDBC
driver automatically routes each subsequent query to a different
JethroData server in a round-robin fashion, evenly spreading the
workload from concurrent queries.
11. JethroData Technical White Paper 11
JethroData for Amazon S3
Amazon S3 is a popular, low-cost storage service on the Amazon Web
Services (AWS) cloud platform. It is a JethroData supported storage
solution, so if you work with Amazon S3 you don't need to have a
Hadoop cluster (and HDFS) as storage for JethroData.
When using Amazon S3, JethroData stores all its files (such as
indexes) in a user-specified bucket on Amazon S3. When users want
to query that data, they start one or more instances running JethroData
and can immediately start querying the data. When they are done, they
can just terminate the JethroData instances and stop paying for them.
No data copy phase
When starting a JethroData AWS instance, there is no need to copy all
of the JethroData files from Amazon S3 to that instance or perform
additional preparation. As the JethroData index is already stored on
Amazon S3, a newly-started JethroData instance can immediately start
serving queries. Let's quickly contrast it with two other, common
analytics solutions on AWS.
Ă¨ď¨ Amazon Redshift
Amazon Redshift is an AWS data warehouse service, based on a
parallel database implementation optimized for full scans. If your data
is stored on Amazon S3, you need to load it to Amazon Redshift each
time you start a Redshift cluster. It may not be a problem if your
Redshift cluster is up and running 24/7. However, if you want to run
some queries on an existing, large data set, this massive copy and
additional load step is costly in terms of both time and money.
In addition, while Redshift supports resizing your cluster, that
procedure is not designed to be invoked frequently, as it generates a
massive data copy that can take several hours and puts the cluster in a
read-only mode. In contrast, JethroData servers are stateless and
compute-only (Amazon S3 is used for storage) â so adding or
removing servers is nearly instantaneous.
12. JethroData Technical White Paper 12
Ă¨ď¨ Amazon Elastic MapReduce (temporary Hadoop cluster)
Some users deploy a temporary Hadoop cluster for queries, either
taking advantage of the Amazon Elastic MapReduce (Amazon EMR)
service or deploying their own. However, as with Amazon Redshift, this
type of deployment requires a data copy step from Amazon S3 (where
data is permanently stored) to the local Hadoop file system (HDFS).
Again, that is costly in terms of both time and money.
Access Over Network
Amazon S3 is always accessed over the network. When you read or
write files to Amazon S3, the data is moved between your instance and
Amazon S3 over the network. In other words, with Amazon S3 there is
no data locality â as compute and storage are always separated, you
move the data to your code, by design. As such, it naturally fits the
JethroData architecture and core philosophy of surgically fetching only
the relevant data to a query host over the network.
13. JethroData Technical White Paper 13
Part 3: JethroData Performance Features
I/O-Related Features
JethroData I/O is performed over the network, whether to HDFS or to
Amazon S3. As such, there are numerous I/O optimizations to
minimize the required round-trips and bandwidth to the storage. The
main ones are:
Ă¨ď¨ Compressed columnar format
When data is loaded into the JethroData indexed format, it is broken
into columns. Each column is stored separately, in a different set of
files. At query time, only columns that participate in the query need to
be accessed.
In addition, each column is encoded and compressed, minimizing its
storage footprint.
Ă¨ď¨ Efficient skip-scan I/O
Assume that a query needs to fetch a million values from several
columns, after the WHERE clauses narrowed it down from billions of
rows. The JethroData engine will skip-scan each column â for
example, read a megabyte from the beginning, then skip ahead over a
12MB region, read another 16KB, and so on. Generally, it skip-scans
each column from the beginning to the end, merging read requests if
they are close by in the same file. This method automatically balances
I/O size between network bandwidth consumption and network round-
trips.
Alternatively, JethroData occasionally chooses to fetch an entire
column (at a table or partition level) instead of using an index â for
example, in case of a range scan across many distinct values. In those
cases, JethroData uses block skipping whenever possible â analyzing
the min/max values of each column block and skipping an entire block
if it will not have any relevant rows. This is useful if the data is
14. JethroData Technical White Paper 14
physically clustered or sorted by that specific column.
Finally, JethroData always issues multi-threaded (parallel) I/O requests
to efficiently utilize the bandwidth to the storage tier and to utilize
multiple disks in parallel.
Ă¨ď¨ Local caching
While JethroData issues efficient remote I/Os, it is still better not to
repeatedly issue I/O requests to the same few popular columns,
indexes and metadata files. For that reason, JethroData automatically
caches locally frequently accessed column and index blocks, and also
small metadata files. This reduces the number of round-trips to the
HDFS NameNode and improves query performance with both Hadoop
and Amazon S3.
Query Planning and Execution-Related Features
The JethroData engine includes many query planning and query
execution optimizations. Some of them are based on leveraging
indexes and some are based on understanding how BI tools generate
SQLs on behalf of their users. Highlights include:
Ă¨ď¨ Smart optimizer choices based on indexing
While many databases implement some sort of a cost-based
optimization, they have a common challenge of reliance on collected
statistics. Their optimizer has to take into account that the statistics it
relies on might be inaccurate, stale or even missing, which complicates
its optimizations. It also requires additional administrative steps of
managing and tuning the statistics collection process and parameters.
In contrast, JethroData uses the index metadata instead of collected
statistics for optimization. It is guaranteed to be always up-to-date,
simplifying the optimization logic, avoiding optimization mistakes due to
stale statistics and removing the need to manage statistics.
15. JethroData Technical White Paper 15
In addition, the JethroData engine uses index processing to make
better decisions. For example, when joining tables, it is in many cases
useful to change the join order so the biggest result set will be placed
in the left side of a join. Using its indexes, the JethroData engine can in
most cases know the exact size of the pre-join result set per table,
even after multiple WHERE conditions, and use that information when
deciding on join re-ordering.
Ă¨ď¨ Answering queries from index metadata:
Many simple, common queries can be easily answered directly by
processing the indexes, without the need to access the table's data.
Some examples:
Getting a list of values for a column (common with BI tools â
generating a drop down list):
SELECT distinct col FROM table;
Or alternatively, counting the number of rows per value (common for
data exploration):
SELECT col, count(*) FROM table GROUP BY col
[HAVING...];
Counting the number of rows that match a complex condition:
SELECT count(*) FROM table
WHERE a=1 AND (b=2 OR d > 10) AND c IN (1,2,3);
Ă¨ď¨ Parallel, pipelined execution
The JethroData execution engine is highly parallelized. The execution
plan is made of many fine-grained operators, and the engine
parallelizes the work within and across operators. In addition, the rows
are pipelined between operators, and as the execution engine is multi-
threaded, pipelining does not require memory copy, just passing a
pointer.
16. JethroData Technical White Paper 16
Ă¨ď¨ Adaptive query and index cache
When working with a BI tool, it is very common to have many repeating
or mostly similar queries generated by dashboards and pre-built
reports. Also, many ad-hoc explorations, while eventually diving deep
into a unique subset of the data, typically start by progressively adding
filters on top of the logical data model. The JethroData engine takes
advantage of this predictable BI user behavior pattern by automatically
caching result sets â both intermediate and near final ones â and
transparently using them when applicable.
The adaptive cache is also incremental. For example, assume that
query results were cached and later new rows are inserted to the fact
table used by the query. If the query will be sent again, in most cases,
JethroData will still leverage the cached results. It will compute the
query results only on the delta and will merge it with the cached results
to find the final query output. It will then replace the old cached results
with the new ones. This mechanism is transparent to the user, who will
never see stale or inconsistent data.
Ă¨ď¨ Specific optimizations for BI-generated queries
The JethroData engine includes many optimizations that originated
from analyzing common patterns in SQL statements generated by
popular BI tools. These range from a simple, single index execution
path (for example, getting a distinct list of column values for a
dropdown menu) to more complex optimizations, such as star
transformations and join eliminations.
17. JethroData Technical White Paper 17
Summary
We reviewed how JethroData's indexes are able to efficiently handle
big data by leveraging an append-only design. We then showed how
the JethroData architecture, which separates compute and storage,
works well with both Hadoop and Amazon S3 storage. Finally, we
covered some of JethroData's optimizations that enable interactive BI
response times for business users.
.
18. JethroData Technical White Paper 18
More Information
For the latest information about our product, please visit:
http://www.jethrodata.com
To download the product for a free trial, please visit:
http://www.jethrodata.com/download
Ă¨ď¨ Contact Us
Email: info@jethrodata.com
Twitter: http://twitter.com/jethrodata
19. JethroData Technical White Paper 19
Copyright Š2015 JethroData Ltd. All Rights Reserved.
JethroData logos, and trademarks or registered trademarks of JethroData Ltd. or its subsidiaries in the United States and other countries.
Copyright Š2015 JethroData Ltd.. All Rights Reserved.
Other names and brands may be claimed as the property of others. Information regarding third party products is provided solely for educational purposes.
JethroData is not responsible for the performance or support of third party products and does not make any representations or warranties whatsoever regarding
quality, reliability, functionality, or compatibility of these devices or products.