The document provides an agenda and summaries of presentations for an Apache Phoenix conference. The presentations will cover Phoenix use cases at various companies, new features in Phoenix like ACID transactions using Tephra and cost-based query optimization with Calcite, and interoperability between Phoenix and Drill. One presentation will discuss using Phoenix for time-series data at Salesforce, another will provide tips for optimizing performance on large Phoenix clusters at Sony, and a third will cover how Phoenix is used at eHarmony for batch processing and low-latency queries.
Apache Phoenix: Use Cases and New FeaturesHBaseCon
James Taylor (Salesforce) and Maryann Xue (Intel)
This talk with be broken into two parts: Phoenix use cases and new Phoenix features. Three use cases will be presented as lightning talks by individuals from 1) Sony about its social media NewsSuite app, 2) eHarmony on its matching service, and 3) Salesforce.com on its time-series metrics engine. Two new features will be discussed in detail by the engineers who developed them: ACID transactions in Phoenix through Apache Tephra. and cost-based query optimization through Apache Calcite. The focus will be on helping end users more easily develop scalable applications on top of Phoenix.
Column qualifier encoding in Apache Phoenix provides benefits over using column names as column qualifiers. It assigns numbers as column qualifiers instead of names, allowing for more efficient column renaming and optimizations. Test results showed the encoded approach used less disk space, improved ORDER BY and GROUPED AGGREGATION performance by up to 2x, and had near constant growth in ORDER BY performance as columns increased versus non-encoded approaches. Further work is ongoing to fully implement and test column encoding.
This document provides an agenda and summaries of presentations for an Apache Phoenix conference. The agenda includes presentations on using Phoenix for time-series data at Salesforce, optimization techniques for a large Phoenix/HBase cluster at Sony, and how Phoenix was used at eHarmony. New features to be discussed are ACID transactions powered by Tephra and cost-based query optimization using Calcite. The document also provides a brief summary of each presentation topic.
Apache Phoenix Query Server PhoenixCon2016Josh Elser
This document discusses Apache Phoenix Query Server, which provides a client-server abstraction for Apache Phoenix using Apache Calcite's Avatica sub-project. It allows Phoenix to have thin clients by offloading computational resources to query servers running on Hadoop clusters. This enables non-Java clients through a standardized HTTP API. The query server implementation uses HTTP, Protocol Buffers for serialization, and common libraries like Jetty and Dropwizard Metrics. It aims to simplify Phoenix client development and improve performance and scalability.
This document summarizes a presentation about Apache Phoenix, an open-source project that allows HBase to be queried with SQL. It discusses what Phoenix is, why tracing is important, and the features of a new tracing web app created for Phoenix, including listing traces, visualizing trace distributions and individual trace details. Programming challenges in creating the app and new issues filed are also summarized.
Apache Phoenix is a SQL query layer for Apache HBase that allows users to interact with HBase through JDBC. It transforms SQL queries into native HBase API calls to optimize execution across the HBase cluster in a parallel manner. The presentation covered Phoenix's current features like join support, new features like functional indexes and user defined functions, and the future integration with Apache Calcite to bring more SQL capabilities and a cost-based query optimizer to Phoenix. Overall, Phoenix provides a relational view of data stored in HBase to enable complex SQL queries to run efficiently on large datasets.
This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
Apache Phoenix is a SQL query layer over Apache HBase that allows users to interact with HBase through JDBC and SQL. It transforms SQL queries into native HBase API calls for efficient parallel execution on the cluster. Phoenix provides metadata storage, SQL support, and a JDBC driver. It is now a top-level Apache project after originally being developed at Salesforce. The speaker discussed Phoenix's capabilities like joins and subqueries, new features like HBase 1.0 support and functional indexes, and future plans like improved optimization through Calcite and transaction support.
Apache Phoenix: Use Cases and New FeaturesHBaseCon
James Taylor (Salesforce) and Maryann Xue (Intel)
This talk with be broken into two parts: Phoenix use cases and new Phoenix features. Three use cases will be presented as lightning talks by individuals from 1) Sony about its social media NewsSuite app, 2) eHarmony on its matching service, and 3) Salesforce.com on its time-series metrics engine. Two new features will be discussed in detail by the engineers who developed them: ACID transactions in Phoenix through Apache Tephra. and cost-based query optimization through Apache Calcite. The focus will be on helping end users more easily develop scalable applications on top of Phoenix.
Column qualifier encoding in Apache Phoenix provides benefits over using column names as column qualifiers. It assigns numbers as column qualifiers instead of names, allowing for more efficient column renaming and optimizations. Test results showed the encoded approach used less disk space, improved ORDER BY and GROUPED AGGREGATION performance by up to 2x, and had near constant growth in ORDER BY performance as columns increased versus non-encoded approaches. Further work is ongoing to fully implement and test column encoding.
This document provides an agenda and summaries of presentations for an Apache Phoenix conference. The agenda includes presentations on using Phoenix for time-series data at Salesforce, optimization techniques for a large Phoenix/HBase cluster at Sony, and how Phoenix was used at eHarmony. New features to be discussed are ACID transactions powered by Tephra and cost-based query optimization using Calcite. The document also provides a brief summary of each presentation topic.
Apache Phoenix Query Server PhoenixCon2016Josh Elser
This document discusses Apache Phoenix Query Server, which provides a client-server abstraction for Apache Phoenix using Apache Calcite's Avatica sub-project. It allows Phoenix to have thin clients by offloading computational resources to query servers running on Hadoop clusters. This enables non-Java clients through a standardized HTTP API. The query server implementation uses HTTP, Protocol Buffers for serialization, and common libraries like Jetty and Dropwizard Metrics. It aims to simplify Phoenix client development and improve performance and scalability.
This document summarizes a presentation about Apache Phoenix, an open-source project that allows HBase to be queried with SQL. It discusses what Phoenix is, why tracing is important, and the features of a new tracing web app created for Phoenix, including listing traces, visualizing trace distributions and individual trace details. Programming challenges in creating the app and new issues filed are also summarized.
Apache Phoenix is a SQL query layer for Apache HBase that allows users to interact with HBase through JDBC. It transforms SQL queries into native HBase API calls to optimize execution across the HBase cluster in a parallel manner. The presentation covered Phoenix's current features like join support, new features like functional indexes and user defined functions, and the future integration with Apache Calcite to bring more SQL capabilities and a cost-based query optimizer to Phoenix. Overall, Phoenix provides a relational view of data stored in HBase to enable complex SQL queries to run efficiently on large datasets.
This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
Apache Phoenix is a SQL query layer over Apache HBase that allows users to interact with HBase through JDBC and SQL. It transforms SQL queries into native HBase API calls for efficient parallel execution on the cluster. Phoenix provides metadata storage, SQL support, and a JDBC driver. It is now a top-level Apache project after originally being developed at Salesforce. The speaker discussed Phoenix's capabilities like joins and subqueries, new features like HBase 1.0 support and functional indexes, and future plans like improved optimization through Calcite and transaction support.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
An overview of Apache Phoenix and Apache HBase from the angle of a traditional data warehousing solution. This talk focuses on where this open-source architect fits into the market outlines the features and integrations of the product, showing that it is a viable alternative to traditional data warehousing solutions.
Apache Phoenix: Transforming HBase into a SQL DatabaseDataWorks Summit
The document discusses Apache Phoenix, which transforms HBase into a SQL database. Phoenix turns HBase into a SQL database by providing a query engine, metadata repository, and embedded JDBC driver to access HBase data. It is the fastest way to access HBase data through techniques like push down query optimization and client-side parallelization. Phoenix also helps HBase scale by allowing multiple tables to share the same physical HBase table through updateable views and multi-tenant tables and views.
Apache Phoenix’s relational database view over Apache HBase delivers a powerful tool which enables users and developers to quickly and efficiently access their data using SQL. However, Phoenix only provides a Java client, in the form of a JDBC driver, which limits Phoenix access to JVM-based applications. The Phoenix QueryServer is a standalone service which provides the building blocks to use Phoenix from any language, not just those running in a JVM. This talk will serve as a general purpose introduction to the Phoenix QueryServer and how it complements existing Apache Phoenix applications. Topics covered will range from design and architecture of the technology to deployment strategies of the QueryServer in production environments. We will also include explorations of the new use cases enabled by this technology like integrations with non-JVM based languages (Ruby, Python or .NET) and the high-level abstractions made possible by these basic language integrations.
The document discusses Apache HBase replication, which asynchronously copies data between HBase clusters. It uses a push-based architecture shipping write-ahead log (WAL) entries similarly to MySQL replication. Replication provides eventual consistency and preserves the atomicity of individual updates. Administrators can configure replication by setting parameters and managing peer clusters and queues stored in Zookeeper. Replicated edits flow from the replication source on a region server to the remote replication sink where they are applied.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaCloudera, Inc.
This talk will run through the list of filters that are shipped with HBase and show how they are used from a client application. Filters expose varying feature sets, but also exhibit an equally varying impact on read performance – but neither are directly intuitive. A skilled HBase practitioner should know how to select the proper filter for a given use-case, or how to combine sets of filters to achieve what is needed. The talk will conclude with an example for a custom filter and explain how to deploy it on a cluster.
Apache Phoenix is a SQL skin over HBase that allows for low latency SQL queries over HBase data. It transforms SQL queries into native HBase APIs like scans and puts. Phoenix supports features like secondary indexing, multi-tenancy, and limited hash joins. It aims to leverage existing SQL tooling while providing performance optimizations like parallel scans. Upcoming features include improved secondary indexing and transaction support. Phoenix maps to existing HBase tables and allows dynamic columns to extend schemas during queries.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...Trieu Nguyen
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack
Why we still need SQL for Big Data ?
How to make Big Data more responsive and faster ?
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
This document summarizes a talk on implementing timeline consistency for HBase region replicas. It introduces the concept of region replicas, where each region has multiple copies hosted on different servers. The primary accepts writes, while secondary replicas are read-only. Reads from secondaries return possibly stale data. The talk outlines the implementation of region replicas in HBase, including updates to the master, region servers, and IPC. It discusses data replication approaches and next steps to implement write replication using the write-ahead log. The goal is to provide high availability for reads in HBase while tolerating single-server failures.
SQL AlwaysON for SharePoint HA/DR on Azure Global Azure Bootcamp 2017 Eisenac...Lars Platzdasch
This document discusses implementing AlwaysOn Availability Groups (AOAGs) for SharePoint on-premises and Azure SQL replicas to provide high availability and disaster recovery (HA/DR). It covers planning considerations for deployment, connectivity, and identity when using AOAGs. It also provides a step-by-step guide to configuring AOAGs for SharePoint, including prerequisites, database support, example designs, and scripts for setup and log management.
This document provides an overview of Apache Phoenix, including:
- A brief history of how it originated as an internal project at Salesforce before becoming a top-level Apache project.
- An architectural overview explaining that Phoenix provides a SQL interface for Apache HBase and runs on top of HDFS to enable next-generation data applications on HBase.
- Descriptions of Phoenix's key capabilities like SQL support, transactions, user-defined functions, and secondary indexes to boost query performance.
- Examples of how Phoenix can be used for common scenarios like analyzing server metrics data.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
Design Fair Poster - Sustainable Control SystemsJordan Robinson
The document proposes two systems for improving sustainability in modern homes: (1) A passive heating and cooling system using an automated electro-mechanical fan system, and (2) A rainwater collection and distribution system that optimizes water recovery and plant growth. Microcontroller-based systems will be used to accomplish these in a cost-effective manner. Prototypes of both systems will be constructed and tested using sensors and simulations to analyze their potential economic and environmental impacts on homeowners through reduced utility bills and lower emissions. The results aim to demonstrate that simple mechanical systems can provide benefits to homeowners by taking advantage of environmental conditions.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
An overview of Apache Phoenix and Apache HBase from the angle of a traditional data warehousing solution. This talk focuses on where this open-source architect fits into the market outlines the features and integrations of the product, showing that it is a viable alternative to traditional data warehousing solutions.
Apache Phoenix: Transforming HBase into a SQL DatabaseDataWorks Summit
The document discusses Apache Phoenix, which transforms HBase into a SQL database. Phoenix turns HBase into a SQL database by providing a query engine, metadata repository, and embedded JDBC driver to access HBase data. It is the fastest way to access HBase data through techniques like push down query optimization and client-side parallelization. Phoenix also helps HBase scale by allowing multiple tables to share the same physical HBase table through updateable views and multi-tenant tables and views.
Apache Phoenix’s relational database view over Apache HBase delivers a powerful tool which enables users and developers to quickly and efficiently access their data using SQL. However, Phoenix only provides a Java client, in the form of a JDBC driver, which limits Phoenix access to JVM-based applications. The Phoenix QueryServer is a standalone service which provides the building blocks to use Phoenix from any language, not just those running in a JVM. This talk will serve as a general purpose introduction to the Phoenix QueryServer and how it complements existing Apache Phoenix applications. Topics covered will range from design and architecture of the technology to deployment strategies of the QueryServer in production environments. We will also include explorations of the new use cases enabled by this technology like integrations with non-JVM based languages (Ruby, Python or .NET) and the high-level abstractions made possible by these basic language integrations.
The document discusses Apache HBase replication, which asynchronously copies data between HBase clusters. It uses a push-based architecture shipping write-ahead log (WAL) entries similarly to MySQL replication. Replication provides eventual consistency and preserves the atomicity of individual updates. Administrators can configure replication by setting parameters and managing peer clusters and queues stored in Zookeeper. Replicated edits flow from the replication source on a region server to the remote replication sink where they are applied.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaCloudera, Inc.
This talk will run through the list of filters that are shipped with HBase and show how they are used from a client application. Filters expose varying feature sets, but also exhibit an equally varying impact on read performance – but neither are directly intuitive. A skilled HBase practitioner should know how to select the proper filter for a given use-case, or how to combine sets of filters to achieve what is needed. The talk will conclude with an example for a custom filter and explain how to deploy it on a cluster.
Apache Phoenix is a SQL skin over HBase that allows for low latency SQL queries over HBase data. It transforms SQL queries into native HBase APIs like scans and puts. Phoenix supports features like secondary indexing, multi-tenancy, and limited hash joins. It aims to leverage existing SQL tooling while providing performance optimizations like parallel scans. Upcoming features include improved secondary indexing and transaction support. Phoenix maps to existing HBase tables and allows dynamic columns to extend schemas during queries.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...Trieu Nguyen
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack
Why we still need SQL for Big Data ?
How to make Big Data more responsive and faster ?
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
This document summarizes a talk on implementing timeline consistency for HBase region replicas. It introduces the concept of region replicas, where each region has multiple copies hosted on different servers. The primary accepts writes, while secondary replicas are read-only. Reads from secondaries return possibly stale data. The talk outlines the implementation of region replicas in HBase, including updates to the master, region servers, and IPC. It discusses data replication approaches and next steps to implement write replication using the write-ahead log. The goal is to provide high availability for reads in HBase while tolerating single-server failures.
SQL AlwaysON for SharePoint HA/DR on Azure Global Azure Bootcamp 2017 Eisenac...Lars Platzdasch
This document discusses implementing AlwaysOn Availability Groups (AOAGs) for SharePoint on-premises and Azure SQL replicas to provide high availability and disaster recovery (HA/DR). It covers planning considerations for deployment, connectivity, and identity when using AOAGs. It also provides a step-by-step guide to configuring AOAGs for SharePoint, including prerequisites, database support, example designs, and scripts for setup and log management.
This document provides an overview of Apache Phoenix, including:
- A brief history of how it originated as an internal project at Salesforce before becoming a top-level Apache project.
- An architectural overview explaining that Phoenix provides a SQL interface for Apache HBase and runs on top of HDFS to enable next-generation data applications on HBase.
- Descriptions of Phoenix's key capabilities like SQL support, transactions, user-defined functions, and secondary indexes to boost query performance.
- Examples of how Phoenix can be used for common scenarios like analyzing server metrics data.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
Design Fair Poster - Sustainable Control SystemsJordan Robinson
The document proposes two systems for improving sustainability in modern homes: (1) A passive heating and cooling system using an automated electro-mechanical fan system, and (2) A rainwater collection and distribution system that optimizes water recovery and plant growth. Microcontroller-based systems will be used to accomplish these in a cost-effective manner. Prototypes of both systems will be constructed and tested using sensors and simulations to analyze their potential economic and environmental impacts on homeowners through reduced utility bills and lower emissions. The results aim to demonstrate that simple mechanical systems can provide benefits to homeowners by taking advantage of environmental conditions.
Barbara Donaldson has over 16 years of experience in financial services in deposit and lending roles. She has held positions in collateral management, commercial banking, and as an investments clerk. She has a Diploma in Compliance and is a Qualified Financial Adviser. Her experience includes processing legal documents, liaising with relationship managers and customers, and updating systems and spreadsheets. She has strong organizational, communication, and customer service skills.
CA Project and Portfolio Management v14.x: MS Project Integration: Tips, Tric...CA Technologies
The session will show you where and how Microsoft Project and CA Project and Portfolio Management (CA PPM) bi-directionally map data, how to create and manage tasks, assignments and scheduling, as well as updating task status and work. It includes a broad discussion where and when data should be entered in CA PPM and MS Project, on defining data elements and how to capture them.
For more information, please visit http://cainc.to/Nv2VOe
This document provides specifications for 0.625 inch diameter rolled ball screw models in the RS series, including:
- Part number breakdown for screw diameter, lead, nut type, motor mount, and other options.
- Performance specs like load capacity, speed, acceleration, backlash, and life expectancy for different nut and support housing configurations.
- Dimensions for overall length and usable travel depending on selected options.
- Ordering information and contact for the manufacturer.
The MC1XAZ02 mounting card is designed to host μZ-style AZ analog servo drives. It provides screw terminal connectors for power and I/O and test points for monitoring. The small, lightweight card can be directly mounted on a PCB and is compatible with AZ10A4 and AZBDC10A4 drives using Hall sensors.
Currently working for Amazon as EDI analyst.
Working with the vendors in US, maintaining vendor relationship and also assisting the vendors with the EDI queries.
Taking care of the EDI queries related to purchase orders, purchase order confirmations, invoices, transportation etc.
Worked for Via.Com renowned online ticketing company for more than 3+ years, handled mails and complaints. Worked on Airline ticketing.
Completed Bsc Biotechnology from SRN Adarsh collage Bangalore and MBA through Sikkim Manipal University.
Francisco Monaldi, Baker Institute and Harvard University
ERF and AFESD conference on: Monetary and Fiscal Institutions in Resource-Rich Arab Economies
Kuwait, November 4-5, 2015
For more info, please visit www.erf.org.eg
Session on: Oil, Rents and Politics:
Forty-five years after Hossein Mahdavy developed the modern concept of a “rentier state,” hundreds of studies have been conducted on the ways that oil wealth seems to influence governance. This session will give an overview of some key insights about state-building, development policies, accountability and conflict, particularly those that cast light on the oil-rich Arab states during a period of low prices. Venezuela, under chavismo, offers a good political economy illustration of how a country can economically underperform during commodity booms, largely thanks to spending bonanzas related to electoral cycles.
Food security exists when all people have access to sufficient nutritious food. Nearly 800 million people face hunger globally. India has a large population living in poverty and facing malnutrition, though poverty has declined in recent years. Agricultural production and exports have fluctuated in India from 2007-2010. Ensuring food security for the growing global population will require increased agricultural output and addressing challenges like water scarcity, climate change, and land degradation. International organizations monitor food security indicators and work to promote initiatives and policies to achieve food security.
This document discusses urinary tract infections (UTIs). It notes that UTIs are most commonly caused by Escherichia coli bacteria entering the urinary tract from the colon. Symptoms can include dysuria, low grade fever, frequency, urgency, and flank pain or high fever with pyelonephritis. Diagnosis involves examining a clean catch urine sample under a microscope for signs of pyuria, neutrophils, bacteria, and gram-negative bacilli through gram stain and culture tests to confirm E. coli as the causative organism. Complications can include sepsis, prostatitis, or prostatic abscess if not properly treated.
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
Swarnim Kulkarni (Cerner)
Cerner has been an active consumer of HBase for a very long time, storing petabytes of healthcare data in its multiple isolated HBase clusters. This talk will walk through the design of Cerner's enterprise data hub with a focus on the multi-tenant HBase as a service offering within the hub.
How can collaboration be fostered in a highly distributed development like the automotive industry. New mega trends, convergence of domains and interoparable tools are the key factors.
Änderungsmanagement ist eine große Herausforderung für Organisationen, Unternehmen und deren Führungskräfte. In welchen Phasen sich Veränderung typischerweise abspielt und wie sich Veränderung aktiv gestalten lässt, zeigt diese Präsentation.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
This document discusses setting up a pilot Kafka service at CERN. It outlines requirements from five major use cases, including throughput, retention policies, security, infrastructure needs, and administration capabilities. The current development of an on-demand Kafka service approach is described, along with configuration and management APIs, security configuration, and monitoring capabilities. Next steps include evaluating the pilot service, consolidating it to production, and further developing the self-service platform.
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
Amazon EMR is a managed Hadoop service that makes it easy for customers to use big data frameworks and applications like Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3 , Amazon’s highly scalable object storage service. In this webinar, we will introduce the latest release of Amazon EMR. With Amazon EMR release 5.0, customers can now launch the latest versions of popular open source frameworks including Apache Spark 2.0, Hive 2.1, Presto 0.151, Tez 0.8.4, and Apache Hadoop 2.7.2. We will walk through a demo to show you how to deploy a Hadoop environment within minutes. We will cover common use cases and best practices to lower costs using Amazon S3 as your data store and Amazon EC2 Spot Instances, which allow you to bid on space Amazon computing capacity.
Learning Objectives:
• Describe the new features and updated frameworks in Amazon EMR 5.0
• Learn best practices and real-world applications for Amazon EMR
• Understand how to use EC2 Spot pricing to save costs
• Explain the advantages of decoupling storage and compute with Amazon S3 as storage layer for EMR workloads
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
1) HAWQ is an SQL and machine learning engine that runs on Hadoop, providing SQL capabilities and machine learning functionality directly on HDFS data.
2) HAWQ provides up to 30x faster performance than other SQL-on-Hadoop engines like Impala and Hive, through its massively parallel processing (MPP) architecture and query optimization capabilities.
3) Key features of HAWQ include ANSI SQL compliance, integrated machine learning via the MADlib library, flexible deployment across on-premises and cloud environments, and high scalability to petabytes of data.
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack
This document discusses using Phoenix and Spark with ApsaraDB HBase. It covers the architecture of Phoenix as a service over HBase, use cases like log and internet company scenarios, best practices for table properties and queries, challenges around availability and stability, and improvements being made. It also discusses how Spark can be used for analysis, bulk loading, real-time ETL, and to provide elastic compute resources. Example architectures show Spark SQL analyzing HBase and structured streaming incrementally loading data. Scenarios discussed include online reporting, complex analysis, log indexing and querying, and time series monitoring.
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
Как устроить анализ данных 40 млн. человек за 5 лет так, чтобы это выглядело почти в реальном времени.
This document provides an agenda and overview for a presentation on SQL on Hadoop. The presentation will cover various SQL on Hadoop technologies including Hive, HAWQ, Impala, SparkSQL, HBase with Phoenix, and Drill. It will also include an introduction, surveys to collect information from attendees, and discussions on networking and food. The hosts will provide background on their experience with big data and Hadoop.
Architectural Evolution Starting from HadoopSpagoWorld
Speech given by Monica Franceschini, Solution Architecture Manager at the Big Data Competencey Center of Engineering Group, in occasion of the Data Driven Innovation Rome 2016 - Open Summit.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
Spark Streaming allows for processing of real-time data streams using Spark. The document discusses using Spark Streaming with Amazon Kinesis for streaming data ingestion. It covers the Spark Streaming and Kinesis integration architecture, how the Spark Kinesis receiver works, scaling considerations, and fault tolerance mechanisms through checkpointing. Examples of monitoring and tuning Spark Streaming jobs on Kinesis data are also provided.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
Streaming Solutions for Real time problemsAbhishek Gupta
The document is a presentation on streaming solutions for real-time problems using Apache Kafka, Kafka Streams, and Redis. It begins with an introduction and overview of the technologies. It then presents a sample monitoring application using metrics from multiple machines as a use case. The presentation demonstrates how to implement this application using Kafka as the event store, Kafka Streams for processing, and Redis as the state store. It also shows how to deploy the application components on Oracle Cloud.
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Christian Tzolov
Slides from ApacheCon BigData 2015 HAWQ/GEODE talk: http://sched.co/3zut
In the space of Big Data, two powerful data processing tools compliment each other. Namely HAWQ and Geode. HAWQ is a scalable OLAP SQL-on-Hadoop system, while Geode is OLTP like, in-memory data grid and event processing system. This presentation will show different integration approaches that allow integration and data exchange between HAWQ and Geode. Presentation will walking you through the implementation of the different Integration strategies demonstrating the power of combining various OSS technologies for processing bit and fast data. Presentation will touch upon OSS technologies like HAWQ, Geode, SpringXD, Hadoop and Spring Boot.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Real time fraud detection at 1+M scale on hadoop stack
HBaseCon2016-final
1. Use Cases and New Features
@ApachePhoenix
http://phoenix.apache.org
V5
2. Agenda
• Phoenix Use Cases
– Argus: Time-series data with Phoenix (Tom Valine, Salesforce.com)
– Five major tips to maximize performance on a 200+ SQL HBase/Phoenix
cluster (Masayasu Suzuki, Sony)
– Phoenix & eHarmony, a perfect match (Vijay Vangapandu, eHarmony)
• What’s new in Phoenix
– ACID Transactions with Tephra (Poorna Chandra, Cask)
– Cost-based Query Optimization with Calcite (Maryann Xue, Intel)
• Q & A
–PhoenixCon tomorrow 9am-1pm @ Salesforce.com, 1 Market St, SF
4. OpenTSDB Limitations
OpenTSDB is good, but we need more
•Tag Cardinality
– Total number of tags per metric is limited to 8
– Performance decreases drastically as tag values increase.
•UID Exhaustion
– Hard limit of 16M UIDs
•Ad hoc querying not possible
– Join to other data sources
– Joins of time series and events
– Simplification of Argus’ transform grammar
5. Phoenix-backed Argus TSDB Service
• 3 day hackathon
• Modeled metric as Phoenix VIEW
– Leverage ROW_TIMESTAMP optimization
• Tag values inlined in row key
– Uses SKIP_SCAN filter optimization
– Allows for secondary indexes on particular metric + tags
• Metric and tag names managed outside of data as metadata
• Eventually leverage Drillix (Phoenix + Drill)
– Cross cluster queries
– Joins to other data sources
6. Write Performance
Using 2 clients to write in parallel. Phoenix is using 10 writer threads per client
7. Read Performance
• Metrics with one tag (60 distinct values)
– OpenTSDB and Phoenix performance comparable for small aggregations
– Phoenix outperforms OpenTSDB as aggregation size increases
8. Disk usage
• Phoenix & OTSDB use approximately the same amount of space with FAST_DIFF
and Snappy compression
9. Five major tips to maximize
performance on a 200+ SQL
HBase/Phoenix cluster
Masayasu “Mas” Suzuki
Shinji Nagasaka
Takanari Tamesue
Sony Corporation
10. Who we are, and why we chose HBase/Phoenix
• We are DevOps members from
Sony’s News Suite team
http://socialife.sony.net/
• HBase/Phoenix was chosen
because of
a. Scalability,
b. SQL compatibility, and
c. secondary indexing support
12. Performance test apparatus & results
• Test apparatus
• Test results
Specs
Number of records 1.2 billion records (1 KB each)
Number of indexes 8 orthogonal indexes
Servers
3 Zookeepers (Zookeeper 3.4.5, m3.xlarge x 3)
3 HMaster servers (hadoop 2.5.0, hbase 0.98.6, Phoenix 4.3.0, m3.xlarge x 3)
200 RegionServers
(hadoop 2.5.0, hbase 0.98.6, Phoenix 4.3.0, r3.xlarge x 199, c4.8xlarge x 1)
Clients 100 x c4.xlarge
Results
Number of queries 51,053 queries/sec
Response time (average) 46 ms
13. Five major tips to maximize performance
using HBase/Phoenix
Ordered by effectiveness (most effective on the very top)
– An extra RPC is issued when the client runs a SQL statement that uses a secondary index
– Using SQL hint clause can mitigate this
– From Ver. 4.7, changing “UPDATE_CACHE_FREQUENCY” may also work (we have yet to test this)
– A memory rich node should be selected for use in RegionServers so as to minimize disk access
– As an example, running major compaction and index creation simultaneously should be avoided
Details will be presented at the PhoenixCon tomorrow (May 25)
2. Use memories aggressively
1. Use SQL hint clause when using a secondary index
4. Scale-out instead of scale-up
3. Manually split Region files if possible but never over split them
5. Avoid running power intensive tasks simultaneously
15. eHarmony and Phoenix a perfect match
NEED FOR
● Handling 30+ Million events during Batch Run
● Serving low latency queries on 16+ Billion records
75th% - 800MS 95th% - 2Sec 99th% - 4Sec
16. eHarmony and Phoenix a perfect match
LAMBDA FOR THE SAVE
• Layered architecture provides fault tolerance
• Hbase as batch storage for write throughput with reasonable read latency
• Apache Phoenix as query layer to work with complex queries with confidence
• Redis as speed layer cache
17. eHarmony and Phoenix a perfect match
PERFORMANCE
Phoenix/HBase goes live
Get Matches API Response Times
Phoenix/HBase goes live
Save Match API Response Times
18. eHarmony and Phoenix a perfect match
• Highly Consistent and fault tolerant
• Need for store level filtering and sorting
• Apache Phoenix helped us build an abstract high performance
query layer on top of Hbase.
• Eased the development process.
• Reduced boiler plate code, which provides maintainability.
• Build complex queries with confidence.
• Secondary indexes.
• JDBC connection.
• Good community support
WHY HBASE AND PHOENIX
HBASE
APACHE PHOENIX
19. eHarmony and Phoenix a perfect match
JAVA ORM LIBRARY(PHO)
• Apache Phoenix helped us build PHO (Phoenix-HBase ORM)
• PHO provides ability to annotate your entity bean and provides interfaces
to build DSL like queries.
Disjunction disjunction = new Disjunction();
for (int statusFilter : statusFilters) {
disjunction.add(Restrictions.eq("status", statusFilter));
}
QueryBuilder.builderFor(FeedItemDto.class).select()
.add(Restrictions.eq("userId", userId))
.add(Restrictions.gte("spotlightEnd", spotlightEndDate))
.add(disjunction)
.setReturnFields(projection)
.addOrder(orderings)
.setMaxResults(maxResults)
.build();
20. eHarmony and Phoenix a perfect match
http://eharmony.github.io/
OPEN SOURCE REPOSITORY
https://github.com/eHarmony/pho
http://www.eharmony.com/about/careers/
*Please Join us for more details at PhoenixCon tomorrow (May 25)
22. Why Transactions?
• All or none semantics simplifies life of
developer
– Ensures every client has a consistent view of data
– Protects against concurrent updates
– No need to reason about what state data is left in
if write fails
– Guaranteed consistency between data and index
23. Apache Tephra
• Transactions on HBase
– Across regions, tables and RPC calls
• ACID semantics
• Tephra Powers
– CDAP (Cask Data Application Platform)
– Apache Phoenix (4.7 onwards)
25. Tephra Components
• TransactionAware client
• Coordinates transaction lifecycle with manager
• Communicates directly with HBase for reads and writes
• Transaction Manager
• Assigns transaction IDs
• Maintains state on in-progress, committed and invalid transactions
• Transaction Processor coprocessor
• Applies server-side filtering for reads
• Cleans up data from failed transactions, and no longer visible versions
26. Snapshot Isolation
• Multi-version concurrency control
– Cell version (timestamp) = transaction ID
– Reads exclude other uncommitted transactions (for
isolation)
• Optimistic Concurrency Control
– Avoids cost of locking rows and tables
– Good if conflicts are rare: short transaction, disjoint
partitioning of work
27. Single client using 10 threads in parallel with 5K batch size
No performance penalty for non-transactional tables
Performance
28. Future Work
• Partitioned Transaction Manager
• Automatic pruning of invalid transaction list
• Read-only transactions
• Performance optimizations
• Conflict detection
• Appends to transaction edit log
30. Integration model
Calcite Parser & Validator
Calcite Query Optimizer
Phoenix Query Plan Generator
Phoenix Runtime
Phoenix Tables over HBase
JDBC Client
SQL + Phoenix
specific
grammar Built-in rules
+ Phoenix
specific rules
31. Cost-based query optimizer
with Apache Calcite
• Base all query optimization decisions on cost
– Filter push down; range scan vs. skip scan
– Hash aggregate vs. stream aggregate vs. partial stream aggregate
– Sort optimized out; sort/limit push through; fwd/rev/unordered scan
– Hash join vs. merge join; join ordering
– Use of data table vs. index table
– All above (any many others) COMBINED
• Query optimizations are modeled as pluggable rules
32. Beyond Phoenix 4.8
with Apache Calcite
• Get the missing SQL support
– WITH, UNNEST, Scalar subquery, etc.
• Materialized views
– To allow other forms of indices (maybe defined as external), e.g., a filter
view, a join view, or an aggregate view.
• Interop with other Calcite adaptors
– Already used by Drill, Hive, Kylin, Samza, etc.
– Supports any JDBC source
– Initial version of Drill-Phoenix integration already working
33. Query Example - no cost-based optimizer
select empid, e.name,
d.deptno, d.name,
location
from emps e, depts d
using deptno
order by e.deptno
Phoenix
Compiler
scan ‘depts’
send ‘depts’ over to RS
& build hash-cache
scan ‘emps’ hash-join ‘depts’
sort joined table on ‘e.deptno’
34. Query Example - with cost-based optimizer
(sort optimization combined with join algorithm decision)
LogicalSort
key: deptno
LogicalJoin
inner,
e.deptno = d.deptno
LogicalProject
empid, e.name, d.deptno,
d.name, location
LogicalTableScan
emps LogicalTableScan
depts
PhoenixTableScan
depts
PhoenixMergeJoin
inner,
e.deptno = d.deptno
PhoenixClientProject
empid, e.name, d.deptno,
d.name, location
Optimizer
Optimization rules
+
Phoenix operator
conversion rules
PhoenixTableScan
emps
PhoenixServerProjec
t
empid, name, deptno
PhoenixServerProject
deptno, name, location
select empid, e.name, d.deptno,
d.name, location
from emps e, depts d using deptno
order by e.deptno
PhoenixServerSort
key: deptno
empid
empid
deptno
deptno
deptno
e.deptno;
d.deptno;
e.deptno;
d.deptno;
36. Query Example - Comparison
Query plan w/o cost-based
optimizer
Query plan w/ cost-based optimizer
scan ‘emps’, ‘depts’ first ‘depts’, then ‘emps’ 2 tables in parallel
hash-cache send & build proportional to size of ‘depts’;
might cause exception if too large
none
hash-cache look-up 1 look-up per ‘emps’ row none
sorting sort ‘emps’ join ‘depts’ sort ‘emps’ only
optimization approach Local, serial optimization processes Cost-based, rule-driven, integrated
performance
(single node, 2M * 2K rows)
19.46 s 13.92 s
37. Drillix: Interoperability with Drill
select deptno, sum(salary) from emps group by deptno
Drill Final Aggregation
deptno, sum(salary)
Phoenix Table Scan
emps
Phoenix Tables over HBase
Drill Shuffle
Phoenix Partial Aggregation
deptno, sum(salary)
Stage 1:
Local Partial aggregation
Stage 3:
Final aggregation
Stage 2:
Shuffle partial results
38. Thank you! Questions?
Join us tomorrow for PhoenixCon
Salesforce.com, 1 Market St, SF 9am-1pm
(some companies using Phoenix)