It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
This document discusses how data locality is challenged in cloud computing environments where data is distributed across remote networks. It introduces LLAP (Locality is Locality Abstraction for Pipelines), a caching technique used by Hortonworks Data Cloud that decentralizes data in columnar caches across nodes to improve query performance even when data is remote. The document explains how LLAP handles issues like distributed transactions and node failures to maintain cache consistency and affinity without losing performance. Overall, LLAP aims to overcome data locality issues in the cloud by leveraging efficient caching techniques.
The document discusses various techniques for optimizing data organization and performance in Hive, including:
- Partitioning data by meaningful columns like customer ID or VIN to improve lookup performance.
- Using the right number and size of buckets to avoid performance issues from too many small files or skewed data distribution.
- Denormalizing data and optimizing JOIN queries through techniques like broadcast joins.
- Storing data in its natural types like numbers instead of strings to enable predicate pushdown and better performance.
- Using temporary tables and in-memory storage to optimize queries involving data reorganization or distinct slices.
The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.
The document discusses LLAP (Live Long and Process), a new execution layer for Hive that enables sub-second analytical queries. LLAP uses daemons running on worker nodes to cache data in memory and keep query fragments executing between queries for faster performance. It allows for highly concurrent queries without specialized YARN queues. Benchmarks show LLAP providing up to 90% faster performance over Hive for queries against large datasets. LLAP also aims to serve as a unified data access layer for other systems like Spark SQL.
Hive on Tez with LLAP (Late Loading Application) can achieve query processing speeds of over 100,000 queries per hour. Tuning various Hive and YARN parameters such as increasing the number of executor and I/O threads, memory allocation, and disabling consistent splits between LLAP daemons and data nodes was needed to reach this performance level on a test cluster of 45 nodes. Future work includes adding a web UI for monitoring LLAP clusters and implementing column-level access controls while allowing other frameworks like Spark to still access data through HiveServer2 and prevent direct access to HDFS for security reasons.
The document discusses Long-Lived Application Process (LLAP), a new capability in Apache Hive that enables long-lived daemon processes to improve query performance. LLAP eliminates Hive query startup costs by keeping query execution engines alive between queries. It allows queries to leverage just-in-time optimization and data caching to enable interactive query performance directly on HDFS data. LLAP utilizes asynchronous I/O, in-memory caching, and a query fragment API to optimize query processing. It integrates with Apache Tez to coordinate query execution across long-lived daemon processes and traditional YARN containers.
Major advancements in Apache Hive towards full support of SQL compliance include:
1) Adding support for SQL2011 keywords and reserved keywords to reduce parser ambiguity issues.
2) Adding support for primary keys and foreign keys to improve query optimization, specifically cardinality estimation for joins.
3) Implementing set operations like INTERSECT and EXCEPT by rewriting them using techniques like grouping, aggregation, and user-defined table functions.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
This document discusses how data locality is challenged in cloud computing environments where data is distributed across remote networks. It introduces LLAP (Locality is Locality Abstraction for Pipelines), a caching technique used by Hortonworks Data Cloud that decentralizes data in columnar caches across nodes to improve query performance even when data is remote. The document explains how LLAP handles issues like distributed transactions and node failures to maintain cache consistency and affinity without losing performance. Overall, LLAP aims to overcome data locality issues in the cloud by leveraging efficient caching techniques.
The document discusses various techniques for optimizing data organization and performance in Hive, including:
- Partitioning data by meaningful columns like customer ID or VIN to improve lookup performance.
- Using the right number and size of buckets to avoid performance issues from too many small files or skewed data distribution.
- Denormalizing data and optimizing JOIN queries through techniques like broadcast joins.
- Storing data in its natural types like numbers instead of strings to enable predicate pushdown and better performance.
- Using temporary tables and in-memory storage to optimize queries involving data reorganization or distinct slices.
The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.
The document discusses LLAP (Live Long and Process), a new execution layer for Hive that enables sub-second analytical queries. LLAP uses daemons running on worker nodes to cache data in memory and keep query fragments executing between queries for faster performance. It allows for highly concurrent queries without specialized YARN queues. Benchmarks show LLAP providing up to 90% faster performance over Hive for queries against large datasets. LLAP also aims to serve as a unified data access layer for other systems like Spark SQL.
Hive on Tez with LLAP (Late Loading Application) can achieve query processing speeds of over 100,000 queries per hour. Tuning various Hive and YARN parameters such as increasing the number of executor and I/O threads, memory allocation, and disabling consistent splits between LLAP daemons and data nodes was needed to reach this performance level on a test cluster of 45 nodes. Future work includes adding a web UI for monitoring LLAP clusters and implementing column-level access controls while allowing other frameworks like Spark to still access data through HiveServer2 and prevent direct access to HDFS for security reasons.
The document discusses Long-Lived Application Process (LLAP), a new capability in Apache Hive that enables long-lived daemon processes to improve query performance. LLAP eliminates Hive query startup costs by keeping query execution engines alive between queries. It allows queries to leverage just-in-time optimization and data caching to enable interactive query performance directly on HDFS data. LLAP utilizes asynchronous I/O, in-memory caching, and a query fragment API to optimize query processing. It integrates with Apache Tez to coordinate query execution across long-lived daemon processes and traditional YARN containers.
Major advancements in Apache Hive towards full support of SQL compliance include:
1) Adding support for SQL2011 keywords and reserved keywords to reduce parser ambiguity issues.
2) Adding support for primary keys and foreign keys to improve query optimization, specifically cardinality estimation for joins.
3) Implementing set operations like INTERSECT and EXCEPT by rewriting them using techniques like grouping, aggregation, and user-defined table functions.
This document discusses key architectural considerations for Internet of Things (IoT) systems. It outlines three main tiers: origin, transport, and analytics. The origin tier includes sensors, devices, and gateways that generate IoT data. Common protocols at this tier are discussed. The transport tier orchestrates data flow and can perform transformations. Apache NiFi and minifi are presented as options. The analytics tier is where insights are derived from the data through streaming and batch processing. Apache Beam is highlighted as a framework that can unify both types of processing. The document also discusses firmware versions, parsers, schemas, and data ownership challenges.
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
The Apache Hive ACID project aims to make continuously adding and modifying data in Hive tables efficient and allow long-running queries to run concurrently with updates. It introduces transactional tables that support SQL insert, update, and delete operations. Data is stored in multiple versions to allow concurrent reads and writes. Updates are written to delta files and merged periodically with the base data to improve performance and self-tune storage over time.
The document discusses Live Long and Process (LLAP), a new capability in Apache Hive that enables sub-second query performance. LLAP achieves this through caching the hottest data in RAM on each Hadoop node and running queries against this cache via lightweight long-running daemon processes. It allows for 100% SQL compatibility while integrating with existing security and tools. LLAP provides benefits like failure tolerance, concurrency, ACID transactions, and elastic scaling. Performance tests on TPC-DS queries demonstrated sub-second latency for queries even at large data scales and high concurrency levels.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
Apache Hive 2.0 provides major new features for SQL on Hadoop such as:
- HPLSQL which adds procedural SQL capabilities like loops and branches.
- LLAP which enables sub-second queries through persistent daemons and in-memory caching.
- Using HBase as the metastore which speeds up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark and the cost-based optimizer.
- Many bug fixes and under-the-hood improvements were also made while maintaining backwards compatibility where possible.
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
The document discusses new features in Hive 2.0 including Hive LLAP (Live Long And Process) and Hive on ACID (Atomic, Consistent, Isolated, Durable). Hive LLAP introduces an in-memory caching mechanism that provides sub-second query performance for Hive. Hive on ACID allows for transactions on Hive tables including updates, deletes, and streaming ingestion while maintaining consistency and concurrency. The document provides overviews of how both features work and improvements they provide for analytics workloads on Hive.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
This document summarizes Richard Xu's presentation on tuning Yarn, Hive, and queries on a Hadoop cluster. The initial issues with the cluster included jobs taking hours to finish when they were supposed to take minutes. Initial tuning focused on cluster configuration best practices and increasing Yarn capacity. Further tuning involved limiting user capacity, increasing resources for application masters, and tuning memory settings for MapReduce and Tez. Specific Hive query issues addressed were full table scans, non-deterministic functions, join orders, and data type mismatches. Tools discussed for analysis included Tez visualization and Lipwig. Lessons learned emphasized a holistic tuning approach and understanding data structures and explain plans. Long-lived execution (LLAP) was presented as providing in
This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Raft protocol has been successfully used for consistent metadata replication; however, using it for data replication poses unique challenges. Apache Ratis is a RAFT implementation targeted at high throughput data replication problems. Apache Ratis is being successfully used as a consensus protocol for data stored in Ozone (object store) and Quadra(block device) to provide data throughput that saturates the network links and disk bandwidths.
Pluggable nature of Ratis renders it useful for multiple use cases including high availability, data or metadata replication, and ensuring consistency semantics.
This talk presents the design challenges to achieve high throughput and how Apache Ratis addresses them. We talk about specific optimizations that have been implemented to minimize overheads and scale up the throughput while maintaining correctness of the consistency protocol. The talk also explains how systems like Ozone take advantage of Ratis’s implementation choices to achieve scale. We will discuss the current performance numbers and also future optimizations. MUKUL KUMAR SINGH, Staff Software Engineer, Hortonworks and LOKESH JAIN, Software Engineer, Hortonworks
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex
Presenter:
Chaitanya Chebolu, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover the use-case of ingesting data from Kafka and writing to HDFS with a couple of processing operators - Parser, Dedup, Transform.
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
Many organizations today have already migrated Hadoop workloads to cloud infrastructure or they are actively planning to do such a migration. A common question in this scenario is "Which instance types should I use for my Hadoop cluster?" There are nuances to cloud infrastructure that require careful consideration when deciding which instances types to use. This session will show the results of performance comparison of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance types commonly used in Hadoop clusters. More importantly, we will discuss the relative cost comparison of these instance types to demonstrate the which AWS instances offer the best price to performance ratio using standard benchmarks. Attendees of this session with leave with a better understanding of the performance of AWS EC2 instance types when used for Hadoop workloads and be able to make more informed decisions about which instance types makes the most sense for their needs.
Speakers
Michael Young, Senior Solutions Engineer, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
The document discusses using Hive, HBase, Phoenix, and Calcite to build a single data store for both analytics and transaction processing. It describes some recent improvements to Hive like LLAP (Live Long and Process) that aim to achieve sub-second query response times, as well as using HBase as the Hive metastore to improve performance.
This document discusses strategies for achieving sub-second SQL query performance on Hadoop at scale. It describes two use cases: highly parallel batch reporting on a massive dataset, and online reporting with low latency requirements. For the latter use case, the document evaluates Hive LLAP and Phoenix, finding that Phoenix generally has lower latency, especially for queries with large result sets, through optimizations like skip scans, merging improvements, and table splitting. Tuning HBase and Phoenix configurations can further reduce latency.
This document discusses Spotify's migration of data pipelines to Docker. It provides background on Spotify growing from 50 to 1000 engineers and the challenges of scaling their big data infrastructure. Spotify adopted Docker to help solve packaging and dependency issues, moving pipelines from cron jobs to a REST API and Docker images. Docker is allowing Spotify to transparently migrate their on-premise Hadoop cluster to Google Cloud, handling over 100 petabytes of data and growing.
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
SQL Server 2016 includes several performance improvements that help it run faster than previous versions:
1. Automatic Soft NUMA partitions workloads across NUMA nodes when there are more than 8 CPUs per node to avoid bottlenecks.
2. Dynamic memory objects are now partitioned by CPU to avoid contention on global memory objects.
3. Redo operations can now be parallelized across multiple tasks to improve performance during database recovery.
This document discusses key architectural considerations for Internet of Things (IoT) systems. It outlines three main tiers: origin, transport, and analytics. The origin tier includes sensors, devices, and gateways that generate IoT data. Common protocols at this tier are discussed. The transport tier orchestrates data flow and can perform transformations. Apache NiFi and minifi are presented as options. The analytics tier is where insights are derived from the data through streaming and batch processing. Apache Beam is highlighted as a framework that can unify both types of processing. The document also discusses firmware versions, parsers, schemas, and data ownership challenges.
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
The Apache Hive ACID project aims to make continuously adding and modifying data in Hive tables efficient and allow long-running queries to run concurrently with updates. It introduces transactional tables that support SQL insert, update, and delete operations. Data is stored in multiple versions to allow concurrent reads and writes. Updates are written to delta files and merged periodically with the base data to improve performance and self-tune storage over time.
The document discusses Live Long and Process (LLAP), a new capability in Apache Hive that enables sub-second query performance. LLAP achieves this through caching the hottest data in RAM on each Hadoop node and running queries against this cache via lightweight long-running daemon processes. It allows for 100% SQL compatibility while integrating with existing security and tools. LLAP provides benefits like failure tolerance, concurrency, ACID transactions, and elastic scaling. Performance tests on TPC-DS queries demonstrated sub-second latency for queries even at large data scales and high concurrency levels.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
Apache Hive 2.0 provides major new features for SQL on Hadoop such as:
- HPLSQL which adds procedural SQL capabilities like loops and branches.
- LLAP which enables sub-second queries through persistent daemons and in-memory caching.
- Using HBase as the metastore which speeds up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark and the cost-based optimizer.
- Many bug fixes and under-the-hood improvements were also made while maintaining backwards compatibility where possible.
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
The document discusses new features in Hive 2.0 including Hive LLAP (Live Long And Process) and Hive on ACID (Atomic, Consistent, Isolated, Durable). Hive LLAP introduces an in-memory caching mechanism that provides sub-second query performance for Hive. Hive on ACID allows for transactions on Hive tables including updates, deletes, and streaming ingestion while maintaining consistency and concurrency. The document provides overviews of how both features work and improvements they provide for analytics workloads on Hive.
The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.
This document summarizes Richard Xu's presentation on tuning Yarn, Hive, and queries on a Hadoop cluster. The initial issues with the cluster included jobs taking hours to finish when they were supposed to take minutes. Initial tuning focused on cluster configuration best practices and increasing Yarn capacity. Further tuning involved limiting user capacity, increasing resources for application masters, and tuning memory settings for MapReduce and Tez. Specific Hive query issues addressed were full table scans, non-deterministic functions, join orders, and data type mismatches. Tools discussed for analysis included Tez visualization and Lipwig. Lessons learned emphasized a holistic tuning approach and understanding data structures and explain plans. Long-lived execution (LLAP) was presented as providing in
This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Raft protocol has been successfully used for consistent metadata replication; however, using it for data replication poses unique challenges. Apache Ratis is a RAFT implementation targeted at high throughput data replication problems. Apache Ratis is being successfully used as a consensus protocol for data stored in Ozone (object store) and Quadra(block device) to provide data throughput that saturates the network links and disk bandwidths.
Pluggable nature of Ratis renders it useful for multiple use cases including high availability, data or metadata replication, and ensuring consistency semantics.
This talk presents the design challenges to achieve high throughput and how Apache Ratis addresses them. We talk about specific optimizations that have been implemented to minimize overheads and scale up the throughput while maintaining correctness of the consistency protocol. The talk also explains how systems like Ozone take advantage of Ratis’s implementation choices to achieve scale. We will discuss the current performance numbers and also future optimizations. MUKUL KUMAR SINGH, Staff Software Engineer, Hortonworks and LOKESH JAIN, Software Engineer, Hortonworks
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex
Presenter:
Chaitanya Chebolu, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover the use-case of ingesting data from Kafka and writing to HDFS with a couple of processing operators - Parser, Dedup, Transform.
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
Many organizations today have already migrated Hadoop workloads to cloud infrastructure or they are actively planning to do such a migration. A common question in this scenario is "Which instance types should I use for my Hadoop cluster?" There are nuances to cloud infrastructure that require careful consideration when deciding which instances types to use. This session will show the results of performance comparison of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance types commonly used in Hadoop clusters. More importantly, we will discuss the relative cost comparison of these instance types to demonstrate the which AWS instances offer the best price to performance ratio using standard benchmarks. Attendees of this session with leave with a better understanding of the performance of AWS EC2 instance types when used for Hadoop workloads and be able to make more informed decisions about which instance types makes the most sense for their needs.
Speakers
Michael Young, Senior Solutions Engineer, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
The document discusses using Hive, HBase, Phoenix, and Calcite to build a single data store for both analytics and transaction processing. It describes some recent improvements to Hive like LLAP (Live Long and Process) that aim to achieve sub-second query response times, as well as using HBase as the Hive metastore to improve performance.
This document discusses strategies for achieving sub-second SQL query performance on Hadoop at scale. It describes two use cases: highly parallel batch reporting on a massive dataset, and online reporting with low latency requirements. For the latter use case, the document evaluates Hive LLAP and Phoenix, finding that Phoenix generally has lower latency, especially for queries with large result sets, through optimizations like skip scans, merging improvements, and table splitting. Tuning HBase and Phoenix configurations can further reduce latency.
This document discusses Spotify's migration of data pipelines to Docker. It provides background on Spotify growing from 50 to 1000 engineers and the challenges of scaling their big data infrastructure. Spotify adopted Docker to help solve packaging and dependency issues, moving pipelines from cron jobs to a REST API and Docker images. Docker is allowing Spotify to transparently migrate their on-premise Hadoop cluster to Google Cloud, handling over 100 petabytes of data and growing.
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
SQL Server 2016 includes several performance improvements that help it run faster than previous versions:
1. Automatic Soft NUMA partitions workloads across NUMA nodes when there are more than 8 CPUs per node to avoid bottlenecks.
2. Dynamic memory objects are now partitioned by CPU to avoid contention on global memory objects.
3. Redo operations can now be parallelized across multiple tasks to improve performance during database recovery.
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
Cisco's Unified Fabric provides an integrated networking solution optimized for big data infrastructures using Hadoop. The document describes Cisco's testing of the Unified Fabric using a Hadoop cluster of 128 and 16 nodes running Yahoo's Terasort benchmark on 1TB of data. It found that the Unified Fabric can support the network traffic patterns of Hadoop workloads while efficiently utilizing buffering to absorb bursts of traffic during shuffle and replication phases.
Based on the popular blog series, join me in taking a deep dive and a behind the scenes look at how SQL Server 2016 “It Just Runs Faster”, focused on scalability and performance enhancements. This talk will discuss the improvements, not only for awareness, but expose design and internal change details. The beauty behind ‘It Just Runs Faster’ is your ability to just upgrade, in place, and take advantage without lengthy and costly application or infrastructure changes. If you are looking at why SQL Server 2016 makes sense for your business you won’t want to miss this session.
The state of SQL-on-Hadoop in the CloudNicolas Poggi
With the increase of Hadoop offerings in the Cloud, users are faced with many decisions to make: which Cloud provider, VMs to choose, cluster sizing, storage type, or even if to go to fully managed Platform-as-a-Service (PaaS) Hadoop? As the answer is always "depends on your data and usage", this talk will guide participants over an overview of the different PaaS solutions for the leading Cloud providers. By highlighting the main results benchmarking their SQL-on-Hadoop (i.e., Hive) services using the ALOJA benchmarking project. To compare their current offerings in terms of readiness, architectural differences, and cost-effectiveness (performance-to-price), to entry-level Hadoop based deployments. As well as briefly presenting how to replicate results and create custom benchmarks from internal apps. So that users can make their own decisions about choosing the right provider to their particular data needs.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdfMukundThakur22
Since 2006 the world of big data has moved from terabytes to hundreds of petabytes, from local clusters to remote cloud storage, yet the original Apache Hadoop POSIX-based file APIs have barely changed.
It is wonderful that these APIs have worked so well, but we can do a lot better with remote object stores, by providing new operations which suit them better, targeted at columnar data libraries such as ORC and Spark. Only a few libraries need to migrate to these APIs for significant speedups of all big data applications.
This talk introduces a new Hadoop Filesystem API called "vectored read", coming in Hadoop 3.4. An extension of the classic FSDataInputStream it is automatically offered by all filesystem clients.
The S3A connector is the first object store to provide a custom implementation, reading different blocks of data in parallel. In Apache Hive benchmarks with a modified ORC library, we saw a 2x speedup compared to using the classic s3a connector through the Posix APIs.
We will introduce the API spec, the S3A implementation, and the benchmarks, and show how to use it in your own applications. We will also cover our ongoing work on providing similar speedups with other object stores, and the use of the API in other applications.
Presto was used to analyze logs collected in a Hadoop cluster. It provided faster query performance compared to Hive+Tez, with results returning in seconds rather than hours. Presto was deployed across worker nodes and performed better than Hive+Tez for different query and data formats. With repeated queries, Presto's performance improved further due to caching, while Hive+Tez showed no change. Overall, Presto demonstrated itself to be a faster solution for interactive queries on large log data.
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
This document compares the performance of six supercomputers with over 1,000 processors each on various synthetic benchmarks and applications. The supercomputers have different node sizes, processor counts, and interconnect technologies. Performance is analyzed using a model that breaks down run time into computation, communication, and I/O components. Results show that different systems perform best for different benchmarks and applications, depending on factors like the communication requirements and how well the application scales. The Blue Gene supercomputer shows strong scaling and I/O performance but has limitations in processor speed and memory size per node.
We implemented MapReduce cluster benchmark TeraSort by derivative free optimization (DFO) method having runtime function object. In this, every iteration of DFO method uses new values for Hadoop parameter configuration. These parameters are specified within the framework, we used Chef server and client tool which assists in this cluster configuration to ensure proper implementation of TeraSort application.
UKOUG version of a presentation trying to establish the sensible limits of parallelism on a couple of hardware configurations. Detailed white paper is at http://oracledoug.com/px_slaves.pdf
PostgreSQL 10 will include several new features and improvements, including logical replication which allows replicating specific tables, quorum-based synchronous replication, improved partitioning support, pushing more computations to foreign databases with FDWs, expanded parallel query capabilities, techniques to reduce write amplification, indirect indexes, executor and statistics overhauls, and easier backup/replication configuration defaults. Many of these features are still in development.
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
This document summarizes the results of a benchmark test comparing the performance of GeoServer and MapServer web map server (WMS) implementations against different data backends and workloads. Key findings include that GeoServer was generally faster than MapServer at reading shapefiles and rendering plain polygons. Performance was similar between the two when using PostGIS and Oracle spatial backends. MapServer showed improved performance for labelled roads rendering compared to previous tests. Areas for potential improvement in future tests are also discussed.
Similar to A TPC Benchmark of Hive LLAP and Comparison with Presto (20)
Cloud Era Transactional Processing -- Problems, Strategies and SolutionsYu Liu
The document discusses challenges and solutions for transactional processing in the cloud era. It covers modeling transactional consistency constraints, choosing appropriate consistency models like causal consistency, and state-of-the-art academic research in coordination avoidance, consistency models, and hardware efforts to improve transaction processing performance. The document provides definitions of consistency models and isolation levels and compares different approaches.
The document discusses natural language processing (NLP) for medical documents, specifically retrieving International Classification of Diseases (ICD) codes from free-text medical reports. It summarizes a medical NLP shared task called MedNLPDoc that aimed to retrieve information from Japanese medical reports. The highest performing system used a rule-based approach, showing rules can still outperform machine learning for medical NLP. Collaboration between researchers and enterprises was encouraged to resolve gaps between academic research and real-world requirements.
Survey on Parallel/Distributed Search EnginesYu Liu
This document summarizes a survey on parallel and distributed search engines. It discusses how web search tasks like crawling billions of documents, indexing terabytes of data, and responding to thousands of queries simultaneously require a parallel or distributed approach. It then provides examples of distributed search engines and technologies like MapReduce, and discusses challenges in distributed search like resource representation, selection, and result merging. Finally, it surveys parallel implementations of clustering algorithms and challenges in parallelizing hierarchical agglomerative clustering with MapReduce.
Paper introduction to Combinatorial Optimization on Graphs of Bounded TreewidthYu Liu
This slides introduced the paper: H. L. Bodlaender and a. M. C. a. Koster, “Combinatorial Optimization on Graphs of Bounded Treewidth,” Comput. J., vol. 51, no. 3, pp. 255–269, Nov. 2007.
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionYu Liu
The paper Combinatorial Model and Bounds for Target Set Selection by Eyal Ackerman, Oren Ben-Zwi, Guy Wolfovitz:
1. a combinatorial model for the dynamic activation process of
influential networks;
2. representing Perfect Target Set Selection Problem and its
variants by linear integer programs;
3. combinatorial lower and upper bounds on the size of the
minimum Perfect Target Set
An accumulative computation framework on MapReduce ppl2013Yu Liu
The document discusses an accumulative computation framework on MapReduce clusters. It presents examples of accumulative computation programs and benchmarks their performance on MapReduce. The experiments show the framework can process large datasets in a reasonable time and achieves near-linear speedup when increasing CPUs, demonstrating the efficiency and scalability of the approach. The accumulative computation pattern and framework simplify parallelizing problems that have data dependencies and allow encoding many parallel computations.
A Homomorphism-based Framework for Systematic Parallel Programming with MapRe...Yu Liu
This document describes a homomorphism-based framework for systematic parallel programming with MapReduce. The framework introduces a systematic approach to automatically generate fully parallelized and scalable MapReduce programs. It provides algorithmic programming interfaces that allow users to focus on the algebraic properties of problems, hiding the details of MapReduce. The framework was implemented on top of Hadoop and evaluated on several test problems, demonstrating good scalability and parallelism. Future work could decrease system overhead, optimize performance further, and extend the framework to more complex data structures like trees and graphs.
An Introduction of Recent Research on MapReduce (2011)Yu Liu
This document summarizes recent research on MapReduce. It outlines papers presented at the MAPREDUCE11 conference and Hadoop World 2010, including papers on resource attribution in data clusters, shared-memory MapReduce implementations, static type checking of MapReduce programs, QR factorizations, genome indexing, and optimizing data selection. It also summarizes talks and lists several interesting papers on topics like distributed data processing.
Introduction of A Lightweight Stage-Programming FrameworkYu Liu
The Lightweight Stage-Programming Framework introduced in this slides can be used for making efficient parallel DSL which can be transformed to MapReduce programs. To understand this slides, please firstly read http://www.slideshare.net/YuLiu19/a-generatetestaggregate-parallel-programming-library-on-spark.
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
This document summarizes a presentation on developing a MapReduce algorithm to recognize patterns in large graphs by finding connected components. It discusses:
- Motivation to study parallel graph algorithms and frameworks like MapReduce and Pregel
- The problem of finding link patterns in graphs by extracting connected components
- Background on semantic web and linked open data modeled as RDF graphs
- A naive O(2Ck)-iteration MapReduce algorithm to find connected components between pairs of datasets
- Examples and analysis of the algorithm's complexity and communication costs
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
Pig is a platform for analyzing large datasets that uses Pig Latin, a high-level language, to express data analysis programs. Pig Latin programs are compiled into MapReduce jobs and executed on Hadoop. Pig Latin provides data manipulation constructs like SQL as well as user-defined functions. The Pig system compiles programs through optimization, code generation, and execution on Hadoop. Future work focuses on additional optimizations, non-Java UDFs, and interfaces like SQL.
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu
The document summarizes a paper on ultra-succinct representations of ordered trees. It introduces tree degree entropy, a new measure of information in trees. It presents a succinct data structure that uses nH*(T) + O(n log log n / log n) bits to represent an ordered tree T with n nodes, where H*(T) is the tree degree entropy. This representation supports computing consecutive bits of the tree's DFUDS representation in constant time. It also supports computing operations like lowest common ancestor, depth, and level-ancestor in constant time using an auxiliary structure of O(n(log log n)2 / log n) bits.
On Implementation of Neuron Network(Back-propagation)Yu Liu
This document outlines Yu Liu's work implementing and comparing different parallel versions of a neural network using backpropagation. It discusses motivations for parallel programming practice and library study. It provides an introduction to neural networks and backpropagation algorithms. Three implementations are compared: sequential C++ STL, Skelton library, and Intel TBB. Benchmark results show improved speedups from parallel versions. Remaining challenges are also noted, like addressing local minima problems and testing on larger data.
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on HadoopYu Liu
This document describes Yu Liu's ScrewDriver Rebirth framework for implementing the generate-test-aggregate algorithm on Hadoop. The framework uses semiring structures to represent the generate, test, and aggregate functions. It defines Generator and Aggregater classes to implement generation and aggregation. The framework allows fusing operations by lifting semirings and defining new generators. Examples show various generators, tests, and aggregators run on Hadoop to evaluate performance improvements over the previous version.
A Homomorphism-based MapReduce Framework for Systematic Parallel ProgrammingYu Liu
The document outlines a homomorphism-based framework for parallel programming on MapReduce. It introduces homomorphisms and theorems about them. The framework represents lists as sets of key-value pairs distributed across nodes. Functions are implemented using this representation and MapReduce, allowing easy parallelization of problems like maximum prefix sum that are otherwise complex on MapReduce.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
3. Conclusions for Impatience
❏ Hive LLAP brings significant improvements on performance
❏ Hive LLAP is outperformed on TPC-DS, compared with non-LLAP and Presto
❏ Presto shows some advantages on TPC-H
❏ Hive LLAP causes bigger footprints of RAM usage and need careful tuning
4. Environment Setting
Cluster configurations (AWS EC2):
❏ 1 X Master : r4.xlarge (4 vCPU - 32GB RAM)
❏ 10 X Workers : i3.large (2 vCPU - 16GB RAM)
Stacks:
❏ HDP 2.6 (Hive 2.1.0, Tez 0.7.0, Calcite 1.2.0)
❏ Presto 0.208
TPC Data Sets
❏ 10 GB Text/ORC format for both DS and H
5. Results and Comparisons
The way of calculating query performance in this table including “query-duration” and “job-submission” time.
6. Dive into Details
In total 99 queries
Hive LLAP Presto Comparison (H v P)
Faster cases (num) 53 17 3.1 times
Total Runtime (s) 1351 2058 65.6% (1.5 times
faster)
Failed cases (num) 0 21 (OOM or syntax
error)
Due to the time constraint, the benchmark used a same set of SQLs for Hive 2
8. Compare with Current EMR(5) Installation
Conditions:
➔ Input Data: same (24 tables,10 GB text and converted to ORC format)
➔ TPC-DS queries on ORC tables
➔ EMR 5 instances spec. are much better (worker nodes = 16 vCPU, 32GB RAM x 10)
Results: tens of times faster than EMR5 (details in next slide)
Reasons:
1. LLAP setting
2. ORC file format issue
3. Other Hive configurations
9. TPC-DS Queries Duration Time
Query No. Run on EMR(5) in sec. Run on Hive LLAP in sec. Difference (times)
No. 1 130.717 1.855 70
No. 11 184.94 7.434 25
No. 12 31.565 1.384 23
No. 50 90.435 11.576 8
No. 66 140.1 3.742 38
The way of calculating query performance in this table only including “query-duration” time.
10. Many other details are omitted
The following details are omitted because of space, in general and obviously they
are slower
❏ Hive with LLAP on MR engine
❏ Hive with Tez without LLAP
11. Future work
❏ Hive LLAP with Druid as the storage engine
❏ Compare with EMR 5.0 installation (with some constraint on supporting ORC)
❏ Parquet vs ORC
❏ ORC with different compaction strategies (BI/ETL/Hybrid)
❏ Evaluation on creations of bloom filter and zone map