The document discusses using Hive, HBase, Phoenix, and Calcite to build a single data store for both analytics and transaction processing. It describes some recent improvements to Hive like LLAP (Live Long and Process) that aim to achieve sub-second query response times, as well as using HBase as the Hive metastore to improve performance.
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain
Hadoop is an open source framework designed to rapidly ingest, store, and analyze large data sets. Hadoop is well suited for batch processing where immediate interactive analytics are not required. But today, Hadoop does not support the operational and transactional workloads. These workloads consist of a constant flow of transactions requiring low-latency response times for read/write access.
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain
Hadoop is an open source framework designed to rapidly ingest, store, and analyze large data sets. Hadoop is well suited for batch processing where immediate interactive analytics are not required. But today, Hadoop does not support the operational and transactional workloads. These workloads consist of a constant flow of transactions requiring low-latency response times for read/write access.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
This talk will give an overview of two exciting releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2018. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that are long time in development, some of which include rewritten region assignment , perf improvements (RPC, rewritten write pipeline, etc), async clients and WAL, C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest release because of its integration with HBase 2.0 and lot of performance improvements in support of secondary Indexes. It has a lot of cool features such as encoded columns, Kafka, Hive integration, and many other performance improvements. Ankit Singhal, Senior Software Engineer, Hortonworks Inc. and Rajeshbabu Chintaguntl, Staff Software Engineer, Hortonworks
We will talk about two real-world challenging SQL on Hadoop use cases: #1 Highly Parallel Workload Over Massive Data, #2 Sub-second SQL for Online Reporting. The challenge is to meet very strict performance requirement over hundreds of billions of data. We will introduce how we solved these challenges using Hive on Tez, Hive LLAP and Phoenix. With real-life performance number!
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
This talk will give an overview of two exciting releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2018. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that are long time in development, some of which include rewritten region assignment , perf improvements (RPC, rewritten write pipeline, etc), async clients and WAL, C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest release because of its integration with HBase 2.0 and lot of performance improvements in support of secondary Indexes. It has a lot of cool features such as encoded columns, Kafka, Hive integration, and many other performance improvements. Ankit Singhal, Senior Software Engineer, Hortonworks Inc. and Rajeshbabu Chintaguntl, Staff Software Engineer, Hortonworks
We will talk about two real-world challenging SQL on Hadoop use cases: #1 Highly Parallel Workload Over Massive Data, #2 Sub-second SQL for Online Reporting. The challenge is to meet very strict performance requirement over hundreds of billions of data. We will introduce how we solved these challenges using Hive on Tez, Hive LLAP and Phoenix. With real-life performance number!
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Keynote from Apache Big Data EU. This introduces training that we are doing at Hortonworks to help our employees work understand and work well as part of the Apache Software Foundation
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
The main objective of this workshop is to give the audience hands on experience with several Hadoop technologies and jump start their hadoop journey. In this workshop, you will load data and submit queries using Hadoop! Before jumping in to the technology, the Founders of DataKitchen review Hadoop and some of its technologies (MapReduce, Hive, Pig, Impala and Spark), look at performance, and present a rubric for choosing which technology to use when.
Keynote slides from Big Data Spain Nov 2016. Has some thoughts on how Hadoop ecosystem is growing and changing to support the enterprise, including Hive, Spark, NiFi, security and governance, streaming, and the cloud.
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureMats Johansson
Presentation at Data Innovation Summit 2016 in Stockholm
How to build a modern data architecture supporting data in motion and data at rest with Hortonworks Data Flow and Data Platform.
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
Slides for our solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
Hive 3 new SQL features including LLAP, workload management, SQL over Kafka and JDBC data sources, integration with Spark via Hive Warehouse Connector, ACID 2, and constraints and default values
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Speaker: Alan Gates, Co-Founder, Hortonworks
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
Similar to Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015 (20)
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
1. Hive & HBase For
Transaction Processing
Page 1
Alan Gates
@alanfgates
2. Agenda
Page 2Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
3. Agenda
Page 3Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
4. A Brief History of Hive
Page 4Hive & HBase For Transaction Processing
• Initial goal was to make it easy to execute MapReduce using a familiar
language: SQL
– Most queries took minutes or hours
– Primarily used for batch ETL jobs
• Since 0.11 much has been done to support interactive and ad hoc queries
– Many new features focused on improving performance: ORC and Parquet, Tez and
Spark, vectorization
– As of Hive 0.14 (November 2014) TPC-DS query 3 (star-join, group, order, limit) using
ORC, Tez, and vectorization finishes in 9s for 200GB scale and 32s for 30TB scale.
– Still have ~2-5 second minimum for all queries
• Ongoing performance work with goal of reaching sub-second response time
– Continued investment in vectorization
– LLAP
– Using Apache HBase for metastore
LLAP = Live Long And Process
5. LLAP: Why?
Page 5Hive & HBase For Transaction Processing
• It is hard to be fast and flexible in Tez
– When SQL session starts Tez AM spun up (first query cost)
– For subsequent queries Tez containers can be
– pre-allocated – fast but not flexible
– allocated and released for each query – flexible but start up cost for every query
• No caching of data between queries
– Even if data is in OS cache much of IO cost is deserialization/vector marshaling
which is not shared
6. LLAP: What
Page 6Hive & HBase For Transaction Processing
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
– Multi-threaded engine that runs smaller tasks for query
including reads, filter and some joins
– Use regular Tez tasks for larger shuffle and other operators
• LLAP has In-memory columnar data cache
– High throughput IO using Async IO Elevator with dedicated
thread and core per disk
– Low latency by providing data from in-memory (off heap)
cache instead of going to HDFS
– Store data in columnar format for vectorization irrespective
of underlying file type
– Security enforced across queries and users
• Uses YARN for resource management
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS
7. LLAP: What
Page 7Hive & HBase For Transaction Processing
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes,
accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
8. LLAP: Is and Is Not
Page 8Hive & HBase For Transaction Processing
• It is not MPP
– Data not shuffled between LLAP nodes (except in limited cases)
• It is not a replacement for Tez or Spark
– Configured engine still used to launch tasks for post-shuffle operations (e.g. hash
joins, distributed aggregations, etc.)
• It is not required, users can still use Hive without installing LLAP
demons
• It is a Map server, or a set of standing map tasks
• It is currently under development on the llap branch
11. HBase Metastore: Why?
Page 11Hive & HBase For Transaction Processing
> 700 metastore queries to plan
TPC-DS query 27!!!
12. HBase Metastore: Why?
Page 12Hive & HBase For Transaction Processing
• Object Relational Modeling is an impedance mismatch
• The need to work across different DBs limits tuning opportunities
• No caching of catalog objects or stats in HiveServer2 or Hive metastore
• Hadoop nodes cannot contact RDBMS directly due to scale issues
• Solution: use HBase
– Can store object directly, no need to normalize
– Already scales, performs, etc.
– Can store additional data not stored today due to RDBMS capacity limitations
– Can access the metadata from the cluster (e.g. LLAP, Tez AM)
13. But...
Page 13Hive & HBase For Transaction Processing
• HBase does not have transactions –
metastore needs them
– Tephra, Omid 2 (Yahoo), others working on this
• HBase is hard to administer and install
– Yes, we will need to improve this
– We will also need embedded option for test/POC
setups to keep HBase from becoming barrier to
adoption
• Basically any work we need to do to HBase
for this is good since it benefits all HBase
users
14. HBase Metastore: How
Page 14Hive & HBase For Transaction Processing
• HBaseStore, a new implementation of RawStore that stores data in
HBase
• Not default, users still free to use RDBMS
• Less than 10 tables in HBase
– DBS, TBLS, PARTITIONS, ... – basically one for each object type
– Common partition data factored out to significantly reduce size
• Layout highly optimized for SELECT and DML queries, longer
operations moved into DDL (e.g. grant)
• Extensive caching
– Of data catalog objects for length of a query
– Of aggregated stats across queries and users
• On going work in hbase-metastore branch
15. Agenda
Page 15Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
16. Apache Phoenix: Putting SQL Back in NoSQL
Page 16Hive & HBase For Transaction Processing
• SQL layer on top of HBase
• Originally oriented toward transaction processing
• Moving to add more analytics type operators
– Adding multiple join implementations
– Requests for OLAP functions (PHOENIX-154)
• Working on adding transactions (PHOENIX-1674)
• Moving to Apache Calcite for optimization (PHOENIX-1488)
17. Agenda
Page 17Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
18. What If?
Page 18Hive & HBase For Transaction Processing
• We could share one O/JDBC driver?
• We could share one SQL dialect?
• Phoenix could leverage extensive analytics
functionality in Hive without re-inventing it
• Users could access their transactional and
analytics data in single SQL operations?
19. How?
Page 19Hive & HBase For Transaction Processing
• Insight #1: LLAP is a storage plus operations
server for Hive; we can swap it out for other
implementations
• Insight #2: Tez and Spark can do post-shuffle
operations (hash join, etc.) with LLAP or HBase
• Insight #3: Calcite (used by both Hive and
Phoenix) is built specifically to integrate
disparate data storage systems
20. Vision
Page 20Hive & HBase For Transaction Processing
• User picks storage location for table in create
table (LLAP or HBase)
• Transactions more efficient in HBase tables but
work in both
• Analytics more efficient in LLAP tables but work
in both
• Queries that require shuffle use Tez or Spark for
post shuffle operators
HDFS
JDBC Server
Node Node
HBase LLAP
Query
Query
Query
Calcite
used for
planning
Phoenix
used for
execution
21. Hurdles
Page 21Hive & HBase For Transaction Processing
• Need to integrate types/data representation
• Need to integrate transaction management
• Work to do in Calcite to optimize transactional queries well