This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL features. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins, and how shuffle joins are implemented in MapReduce.
Enterprises have been using both Big Data and Cloud Computing technologies for years. Until recently, the two have not been combined. Now the agility and efficiency benefits of self-service elastic infrastructure are being extended to Big Data initiatives – whether on-premises or in the public cloud.
This session at Hadoop Summit in San Jose, California (June 2016) discusses the emerging category of Big-Data-as-a-Service (BDaaS) - representing the intersection of Big Data and Cloud Computing.
In this session, Kris Applegate (Cloud and Big Data Solution Architect at Dell) and Thomas Phelan (Co-Founder and Chief Architect at BlueData) outlined the following:
- Innovations that paved the way for Big-Data-as-a-Service
- Definition and categories of Big-Data-as-a-Service
- Key considerations for Big-Data-as-a-Service in the enterprise, including public cloud or on-premises deployment options
A video replay can also be found here: https://youtu.be/_ucPoTKuj8Q
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
In this webinar, we'll:
-Examine the key drivers and use cases for High Availability, performance and scalability for Apache Hadoop.
-Walk through an overview of reference architecture for a Non-Stop Hadoop implementation.
-Show how you can get started with Non-Stop Hadoop with the Hortonworks Data Platform.
Enterprises have been using both Big Data and Cloud Computing technologies for years. Until recently, the two have not been combined. Now the agility and efficiency benefits of self-service elastic infrastructure are being extended to Big Data initiatives – whether on-premises or in the public cloud.
This session at Hadoop Summit in San Jose, California (June 2016) discusses the emerging category of Big-Data-as-a-Service (BDaaS) - representing the intersection of Big Data and Cloud Computing.
In this session, Kris Applegate (Cloud and Big Data Solution Architect at Dell) and Thomas Phelan (Co-Founder and Chief Architect at BlueData) outlined the following:
- Innovations that paved the way for Big-Data-as-a-Service
- Definition and categories of Big-Data-as-a-Service
- Key considerations for Big-Data-as-a-Service in the enterprise, including public cloud or on-premises deployment options
A video replay can also be found here: https://youtu.be/_ucPoTKuj8Q
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
In this webinar, we'll:
-Examine the key drivers and use cases for High Availability, performance and scalability for Apache Hadoop.
-Walk through an overview of reference architecture for a Non-Stop Hadoop implementation.
-Show how you can get started with Non-Stop Hadoop with the Hortonworks Data Platform.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance.
To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Ambry is an open source object store that is responsible for storing all media content at Linkedin. This talk goes over development of Ambry at Linkedin and its architecture to some details.
Soft-Shake 2013 : Enabling Realtime Queries to End UsersBenoit Perroud
Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness.
At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries.
Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind.
There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Big Data Spain
Security is a tradeoff between usability and safety and should be driven by the perceived threats.
https://www.bigdataspain.org/2017/talk/keeping-enterprises-big-data-secure
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Add Redis to Postgres to Make Your Microservices Go Boom!Dave Nielsen
Slides for talk delivered at PostgresOpen 2018 in San Francisco https://postgresql.us/events/pgopen2018/schedule/session/538-add-redis-to-postgres-to-make-your-microservice-go-boom/
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Liberate Your Files with a Private Cloud Storage Solution powered by Open SourceIsaac Christoffersen
Many of today's enterprises are working under a false assumption that there is a trade-off between consumer-centric file sharing and corporate IT policy compliance. This is because most market-leading SaaS solutions for file sync and share are not designed around enterprise IT's needs. They represent growing risks with vendor lock-in, data security, compliance and data ownership.
With a track record in delivering innovative Open Source solutions, Vizuri has an answer to help enterprises overcome these hurdles. By leveraging innovative Red Hat and ownCloud open source solutions, this solution help corporate IT provide a simple to use file sync and share solution for employees. As a result, organizations are able to retain a greater control over valuable intellectual property.
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
In this talk we review what Docker is and why it’s important to Developers, Admins and DevOps when they are using a NoSQL Database such as Aerospike, the high performance NoSQL Database. Persistence is a critical element for a successful multi-Container strategy. We also cover the following topics: Using Docker to Orchestrate a multi container application (Flask + Aerospike) Injecting HAProxy and other production requirements as we deploy to production Scaling the Web and Aerospike clusters to grow to meet demand This presentation led by Alvin Richards, VP of Product at Aerospike includes an interactive demo showcasing the core Docker components (Machine, Engine, Swarm and Compose) along with Aerospike’s integration. We hope you will see how much simpler Docker can make building and deploying multi-node Aerospike based applications.
Similar to 2013 July 23 Toronto Hadoop User Group Hive Tuning (20)
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Neuro-symbolic is not enough, we need neuro-*semantic*
2013 July 23 Toronto Hadoop User Group Hive Tuning
1. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive & Performance
Toronto Hadoop User Group
July 23 2013
Page 1
Presenter:
Adam Muise – Hortonworks
amuise@hortonworks.com
2. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 2
3. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive – SQL Analytics For Any Data Size
Page 3
Sensor
Mobile
Weblog
Opera1onal
/
MPP
Store
and
Query
all
Data
in
Hive
Use
Exis6ng
SQL
Tools
and
Exis6ng
SQL
Processes
4. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive’s Focus
• Scalable SQL processing over data in Hadoop
• Scales to 100PB+
• Structured and Unstructured data
Page 4
5. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Comparing Hive with RDBMS
Page 5
Hive
RDBMS
SQL
Interface.
SQL
Interface.
Focus
on
analy1cs.
May
focus
on
online
or
analy1cs.
No
transac1ons.
Transac1ons
usually
supported.
Par11on
adds,
no
random
INSERTs.
In-‐Place
updates
not
na1vely
supported
(but
are
possible).
Random
INSERT
and
UPDATE
supported.
Distributed
processing
via
map/reduce.
Distributed
processing
varies
by
vendor
(if
available).
Scales
to
hundreds
of
nodes.
Seldom
scale
beyond
20
nodes.
Built
for
commodity
hardware.
OQen
built
on
proprietary
hardware
(especially
when
scaling
out).
Low
cost
per
petabyte.
What’s
a
petabyte?
6. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 6
7. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive: The SQL Interface to Hadoop
Page 7
HiveServer2 Hive
Job TrackerName Node
Task
Tracker
Data Node
Web UI
JDBC /
ODBC
CLI
Hive
Hadoop
Compiler
Optimizer
Executor
Hive
SQL
Map / Reduce
User issues SQL query
Hive parses and plans query
Query converted to Map/Reduce
Map/Reduce run by Hadoop
1
2
3
4
1
2
3
4
8. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive: Reliable SQL Processing at Scale
Page 8
Hive
Task
Tracker
Time 1:
Job = 50% Complete
Progress Stored in HDFS
Node 1
Node 2
Node 3
Node 4
H
D
F
S
Time 2:
Node 3 Fails,
Job = 85% Complete
Node 1
Node 2
Node 3
Node 4
H
D
F
S
Time 3:
Job moves to Node 4
Job = 100% Complete
Node 1
Node 2
Node 3
Node 4
H
D
F
S
SQL
Map /
Reduce
9. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SQL Coverage: SQL 92 with Extensions
Page 9
SQL
Datatypes
SQL
Seman6cs
INT
SELECT,
LOAD,
INSERT
from
query
TINYINT/SMALLINT/BIGINT
Expressions
in
WHERE
and
HAVING
BOOLEAN
GROUP
BY,
ORDER
BY,
SORT
BY
FLOAT
CLUSTER
BY,
DISTRIBUTE
BY
DOUBLE
Sub-‐queries
in
FROM
clause
STRING
GROUP
BY,
ORDER
BY
BINARY
ROLLUP
and
CUBE
TIMESTAMP
UNION
ARRAY,
MAP,
STRUCT,
UNION
LEFT,
RIGHT
and
FULL
INNER/OUTER
JOIN
DECIMAL
CROSS
JOIN,
LEFT
SEMI
JOIN
CHAR
Windowing
func1ons
(OVER,
RANK,
etc.)
VARCHAR
Sub-‐queries
for
IN/NOT
IN,
HAVING
DATE
EXISTS
/
NOT
EXISTS
INTERSECT,
EXCEPT
Legend
Available
Roadmap
New
in
Hive
0.11
10. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 10
11. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Abstractions in Hive
Page 11
Par11ons,
buckets
and
skews
facilitate
faster,
more
direct
data
access.
Database
Table
Table
Par11on
Par11on
Par11on
Bucket
Bucket
Bucket
Op1onal
Per
Table
Skewed
Keys
Unskewed
Keys
12. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
“I heard you should avoid joins…”
• “Joins are evil” – Cal Henderson
– Joins should be avoided in online systems.
• Joins are unavoidable in analytics.
– Making joins fast is the key design point.
Page 12
13. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Quick Refresher on Joins
Page 13
customer
order
first
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
3491
5.99
5
Rodger
Clayton
11914
2934
39.99
22
Verona
Hollen
11915
11914
40.50
10
SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;
Joins
match
values
from
one
table
against
values
in
another
table.
14. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Join Strategies
Page 14
Type
Approach
Pros
Cons
Shuffle
Join
Join
keys
are
shuffled
using
map/reduce
and
joins
performed
join
side.
Works
regardless
of
data
size
or
layout.
Most
resource-‐
intensive
and
slowest
join
type.
Broadcast
Join
Small
tables
are
loaded
into
memory
in
all
nodes,
mapper
scans
through
the
large
table
and
joins.
Very
fast,
single
scan
through
largest
table.
All
but
one
table
must
be
small
enough
to
fit
in
RAM.
Sort-‐
Merge-‐
Bucket
Join
Mappers
take
advantage
of
co-‐loca1on
of
keys
to
do
efficient
joins.
Very
fast
for
tables
of
any
size.
Data
must
be
sorted
and
bucketed
ahead
of
1me.
15. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Shuffle Joins in Map Reduce
Page 15
customer
order
first
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
3491
5.99
5
Rodger
Clayton
11914
2934
39.99
22
Verona
Hollen
11915
11914
40.50
10
SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;
M
{
id:
11911,
{
first:
Nick,
last:
Toner
}}
{
id:
11914,
{
first:
Rodger,
last:
Clayton
}}
…
M
{
cid:
4150,
{
price:
10.50,
quan1ty:
3
}}
{
cid:
11914,
{
price:
12.25,
quan1ty:
27
}}
…
R
{
id:
11914,
{
first:
Rodger,
last:
Clayton
}}
{
cid:
11914,
{
price:
12.25,
quan1ty:
27
}}
R
{
id:
11911,
{
first:
Nick,
last:
Toner
}}
{
cid:
4150,
{
price:
10.50,
quan1ty:
3
}}
…
Iden1cal
keys
shuffled
to
the
same
reducer.
Join
done
reduce-‐side.
Expensive
from
a
network
u1liza1on
standpoint.
16. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Broadcast Join
• Star schemas use dimension tables small enough to fit in RAM.
• Small tables held in memory by all nodes.
• Single pass through the large table.
• Used for star-schema type joins common in DW.
Page 16
17. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
When both are too large for memory:
Page 17
customer
order
first
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
11914
40.50
10
Rodger
Clayton
11914
12337
39.99
22
Verona
Hollen
11915
15912
40.50
10
SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;
CREATE
TABLE
customer
(id
int,
first
string,
last
string)
CLUSTERED
BY(id)
SORTED
BY(id)
INTO
32
BUCKETS;
CREATE
TABLE
order
(cid
int,
price
float,
quantity
int)
CLUSTERED
BY(cid)
SORTED
BY(cid)
INTO
32
BUCKETS;
Cluster
and
sort
by
the
most
common
join
key.
18. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive’s Clustering and Sorting
Page 18
customer
order
first
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
11914
40.50
10
Rodger
Clayton
11914
12337
39.99
22
Verona
Hollen
11915
15912
40.50
10
SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;
Observa1on
1:
Sor1ng
by
the
join
key
makes
joins
easy.
All
possible
matches
reside
in
the
same
area
on
disk.
19. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive’s Clustering and Sorting
Page 19
customer
order
first
last
id
cid
price
quan6ty
Nick
Toner
11911
4150
10.50
3
Jessie
Simonds
11912
11914
12.25
27
Kasi
Lamers
11913
11914
40.50
10
Rodger
Clayton
11914
12337
39.99
22
Verona
Hollen
11915
15912
40.50
10
SELECT
*
FROM
customer
join
order
ON
customer.id
=
order.cid;
Observa1on
2:
Hash
bucke1ng
a
join
key
ensures
all
matching
values
reside
on
the
same
node.
Equi-‐joins
can
then
run
with
no
shuffle.
20. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Controlling Data Locality with Hive
• Bucketing:
– Hash partition values into a configurable number of buckets.
– Usually coupled with sorting.
• Skews:
– Split values out into separate files.
– Used when certain values are frequently seen.
• Replication Factor:
– Increase replication factor to accelerate reads.
– Controlled at the HDFS layer.
• Sorting:
– Sort the values within given columns.
– Greatly accelerates query when used with ORCFile filter
pushdown.
Page 20
21. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Guidelines for Architecting Hive Data
Page 21
Table
Size
Data
Profile
Query
PaNern
Recommenda6on
Small
Hot
data
Any
Increase
replica1on
factor
Any
Any
Very
precise
filters
Sort
on
column
most
frequently
used
in
precise
queries
Large
Any
Joined
to
another
large
table
Sort
and
bucket
both
tables
along
the
join
key
Large
One
value
>25%
of
count
within
high
cardinality
column
Any
Split
the
frequent
value
into
a
separate
skew
Large
Any
Queries
tend
to
have
a
natural
boundary
such
as
date
Par11on
the
data
along
the
natural
boundary
22. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Persistence Formats
• Built-in Formats:
– ORCFile
– RCFile
– Avro
– Delimited Text
– Regular Expression
– S3 Logfile
– Typed Bytes
• 3rd-Party Addons:
– JSON
– XML
Page 22
23. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive allows mixed format.
• Use Case:
– Ingest data in a write-optimized format like JSON or delimited.
– Every night, run a batch job to convert to read-optimized ORCFile.
Page 23
24. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile – Efficient Columnar Layout
Large block size
well suited for
HDFS.
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.
25. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile Advantages
• High Compression
– Many tricks used out-of-the-box to ensure high compression rates.
– RLE, dictionary encoding, etc.
• High Performance
– Inline indexes record value ranges within blocks of ORCFile data.
– Filter pushdown allows efficient scanning during precise queries.
• Flexible Data Model
– All Hive types including maps, structs and unions.
Page 25
26. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
High Compression with ORCFile
Page 26
27. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Some ORCFile Samples
Page 27
sale
id
6mestamp
productsk
storesk
amount
state
10000
2013-‐06-‐13T09:03:05
16775
670
$70.50
CA
10001
2013-‐06-‐13T09:03:05
10739
359
$52.99
IL
10002
2013-‐06-‐13T09:03:06
4671
606
$67.12
MA
10003
2013-‐06-‐13T09:03:08
7224
174
$96.85
CA
10004
2013-‐06-‐13T09:03:12
9354
123
$67.76
CA
10005
2013-‐06-‐13T09:03:18
1192
497
$25.73
IL
CREATE
TABLE
sale
(
id
int,
timestamp
timestamp,
productsk
int,
storesk
int,
amount
decimal,
state
string
)
STORED
AS
orc;
28. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile Options and Defaults
Page 28
Key
Default
Notes
orc.compress
ZLIB
High
level
compression
(one
of
NONE,
ZLIB,
SNAPPY)
orc.compress.size
262,144
(=
256
KiB)
Number
of
bytes
in
each
compression
chunk
orc.stripe.size
268,435,456
(=
256
MiB)
Number
of
bytes
in
each
stripe
orc.row.index.stride
10,000
Number
of
rows
between
index
entries
(must
be
>=
1,000)
orc.create.index
true
Whether
to
create
row
indexes
29. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
No Compression: Faster but Larger
Page 29
sale
id
6mestamp
productsk
storesk
amount
state
10000
2013-‐06-‐13T09:03:05
16775
670
$70.50
CA
10001
2013-‐06-‐13T09:03:05
10739
359
$52.99
IL
10002
2013-‐06-‐13T09:03:06
4671
606
$67.12
MA
10003
2013-‐06-‐13T09:03:08
7224
174
$96.85
CA
10004
2013-‐06-‐13T09:03:12
9354
123
$67.76
CA
10005
2013-‐06-‐13T09:03:18
1192
497
$25.73
IL
CREATE
TABLE
sale
(
id
int,
timestamp
timestamp,
productsk
int,
storesk
int,
amount
decimal,
state
string
)
STORED
AS
orc
tblproperties
("orc.compress"="NONE");
30. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Column Sorting to Facilitate Skipping
Page 30
sale
id
6mestamp
productsk
storesk
amount
state
10005
2013-‐06-‐13T09:03:18
1192
497
$25.73
IL
10002
2013-‐06-‐13T09:03:06
4671
606
$67.12
MA
10003
2013-‐06-‐13T09:03:08
7224
174
$96.85
CA
10004
2013-‐06-‐13T09:03:12
9354
123
$67.76
CA
10001
2013-‐06-‐13T09:03:05
10739
359
$52.99
IL
10000
2013-‐06-‐13T09:03:05
16775
670
$70.50
CA
CREATE
TABLE
sale
(
id
int,
timestamp
timestamp,
productsk
int,
storesk
int,
amount
decimal,
state
string
)
STORED
AS
orc;
INSERT
INTO
sale
AS
SELECT
*
FROM
staging
SORT
BY
productsk;
ORCFile
skipping
speeds
queries
like
WHERE
productsk
=
X,
productsk
IN
(Y,
Z);
etc.
31. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Not Your Traditional Database
• Traditional solution to all RDBMS problems:
– Put an index on it!
• Doing this in Hadoop = #fail
Page 31
Your Oracle DBA
32. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Going Fast in Hadoop
• Hadoop:
– Really good at coordinated sequential scans.
– No random I/O. Traditional index pretty much useless.
• Keys to speed in Hadoop:
– Sorting and skipping take the place of indexing.
– Minimizing data shuffle the other key consideration.
• Skipping data:
– Divide data among different files which can be pruned out.
– Partitions, buckets and skews.
– Skip records during scans using small embedded indexes.
– Automatic when you use ORCFile format.
– Sort data ahead of time.
– Simplifies joins and skipping becomes more effective.
Page 32
33. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Layout Considerations for Fast Hive
Page 33
Skip Reads Minimize Shuffle Reduce Latency
Partition Tables and/or
Use Skew
Sort and Bucket on
Common Join Keys
Increase Replication
Factor For Hot Data
Sort Secondary
Columns when Using
ORCFile
Use Broadcast Joins
when Joining Small
Tables
Enable Short-Circuit
Read
Take Advantage of Tez +
Tez Service (Future)
34. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Partitioning and Virtual Columns
• Partitioning makes queries go fast.
• You will almost always use some sort of partitioning.
• When partitioning you will use 1 or more virtual
columns.
• Virtual columns cause directories to be created in
HDFS.
– Files for that partition are stored within that subdirectory.
Page 34
#
Notice
how
xdate
and
state
are
not
“real”
column
names.
CREATE
TABLE
sale
(
id
int,
amount
decimal,
...
)
partitioned
by
(xdate
string,
state
string);
35. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Loading Data with Virtual Columns
• By default at least one virtual column must be hard-
coded.
• You can load all partitions in one shot:
– set hive.exec.dynamic.partition.mode=nonstrict;
– Warning: You can easily overwhelm your cluster this way.
Page 35
INSERT
INTO
sale
(xdate=‘2013-‐03-‐01’,
state=‘CA’)
SELECT
*
FROM
staging_table
WHERE
xdate
=
‘2013-‐03-‐01’
AND
state
=
‘CA’;
set
hive.exec.dynamic.partition.mode=nonstrict;
INSERT
INTO
sale
(xdate,
state)
SELECT
*
FROM
staging_table;
36. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
You May Need to Re-Order Columns
• Virtual columns must be last within the inserted data
set.
• You can use the SELECT statement to re-order.
Page 36
INSERT
INTO
sale
(xdate,
state=‘CA’)
SELECT
id,
amount,
other_stuff,
xdate,
state
FROM
staging_table
WHERE
state
=
‘CA’;
37. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
The Most Essential Hive Query Tunings
Page 37
38. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tune Split Size – Always
• mapred.max.split.size
and
mapred.min.split.size
• Hive
processes
data
in
chunks
subject
to
these
bounds.
• min
too
large
-‐>
Too
few
mappers.
• max
too
small
-‐>
Too
many
mappers.
• Tune
variables
un6l
mappers
occupy:
– All
map
slots
if
you
own
the
cluster.
– Reasonable
number
of
map
slots
if
you
don’t.
• Example:
– set
mapred.max.split.size=100000000;
– set
mapred.min.split.size=1000000;
• Manual
today,
automa6c
in
future
version
of
Hive.
• You
will
need
to
set
these
for
most
queries.
Page 38
39. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tune io.sort.mb – Sometimes
• Hive
and
Map/Reduce
maintain
some
separate
buffers.
• If
Hive
maps
need
lots
of
local
memory
you
may
need
to
shrink
map/reduce
buffers.
• If
your
maps
spill,
try
it
out.
• Example:
– set
io.sort.mb=100;
Page 39
40. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Settings You Need
• All
the
6me:
– set
hive.op1mize.mapjoin.mapreduce=true;
– set
hive.op1mize.bucketmapjoin=true;
– set
hive.op1mize.bucketmapjoin.sortedmerge=true;
– set
hive.auto.convert.join=true;
– set
hive.auto.convert.sortmerge.join=true;
– set
hive.auto.convert.sortmerge.join.nocondi1onaltask=true;
• When
bucke6ng
data:
– set
hive.enforce.bucke1ng=true;
– set
hive.enforce.sor1ng=true;
• These
and
more
are
set
by
default
in
HDP
1.3.
– Check
for
them
in
hive-‐site.xml
– If
not
present,
set
them
in
your
query
script
Page 40
41. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Check Your Settings
• In Hive shell:
– See all settings with “set;”
– See a particular setting by adding it.
– Change a setting.
Page 41
42. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Create a Staging table
Page 42
CREATE EXTERNAL TABLE pos_staging (!
!txnid STRING,!
!txntime STRING,!
!givenname STRING,!
!lastname STRING,!
!postalcode STRING,!
!storeid STRING,!
!ind1 STRING,!
!productid STRING,!
!purchaseamount FLOAT,!
!creditcard STRING!
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'!
LOCATION '/user/hdfs/staging_data/pos_staging';!
The raw data is the result of initial loading or the output of a
mapreduce or pig job. We create an external table over the results of
that job as we only intend to use it to load an optimized table.
43. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Choose a partition scheme
Page 43
hive> select distinct concat(year(txntime),month(txntime)) as part_dt !
from pos_staging;!
…!
OK!
20121!
201210!
201211!
201212!
20122!
20123!
20124!
20125!
20126!
20127!
20128!
20129!
Time taken: 21.823 seconds, Fetched: 12 row(s)!
Execute a query to determine if the partition choice returns a reasonable result. We will
use this projection to create partitions for our data set. You want to keep your partitions
large enough to be useful in partition pruning and efficient for HDFS storage. Hive has
configurable bounds to ensure you do not exceed per node and total partition counts
(defaults shown):
hive.exec.max.dynamic.partitions=1000
hive.exec.max.dynamic.partitions.pernode=100
44. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Define optimized table
Page 44
CREATE TABLE fact_pos!
(!
!txnid STRING,!
!txntime STRING,!
!givenname STRING,!
!lastname STRING,!
!postalcode STRING,!
!storeid STRING,!
!ind1 STRING,!
!productid STRING,!
!purchaseamount FLOAT,!
!creditcard STRING!
) PARTITIONED BY (part_dt STRING)!
CLUSTERED BY (txnid)!
SORTED BY (txnid)!
INTO 24 BUCKETS!
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");!
The part_dt field is defined in the partition by clause and cannot be the same name as any other
fields. In this case, we will be performing a modification of txntime to generate a partition key. The
cluster and sorted clauses contain the only key we intend to join the table on. We have stored as
ORCFile with Snappy compression.
45. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Load Data Into Optimized Table
Page 45
set hive.enforce.sorting=true;!
set hive.enforce.bucketing=true;!
set hive.exec.dynamic.partition=true;!
set hive.exec.dynamic.partition.mode=nonstrict; !
set mapreduce.reduce.input.limit=-1;!
!
FROM pos_staging!
INSERT OVERWRITE TABLE fact_pos!
PARTITION (part_dt)!
SELECT!
!txnid,!
!txntime,!
!givenname,!
!lastname,!
!postalcode,!
!storeid,!
!ind1,!
!productid,!
!purchaseamount,!
!creditcard,!
!concat(year(txntime),month(txntime)) as part_dt!
SORT BY productid;!
!
We use this commend to load data from our staging table into our
optimized ORCFile format. Note that we are using dynamic partitioning with
the projection of the txntime field. This results in a MapReduce job that will
copy the staging data into ORCFile format Hive managed table.
46. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Increase replication factor
Page 46
hadoop fs -setrep -R –w 5 /apps/hive/warehouse/fact_pos!
Increase the replication factor for the high performance table.
This increases the chance for data locality. In this case, the
increase in replication factor is not for additional resiliency.
This is a trade-off of storage for performance.
In fact, to conserve space, you may choose to reduce the
replication factor for older data sets or even delete them
altogether. With the raw data in place and untouched, you can
always recreate the ORCFile high performance tables. Most
users place the steps in this example workflow into an Oozie
job to automate the work.
47. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Enabling Short Circuit Read
Page 47
In hdfs-site.xml (or your custom Ambari settings for HDFS,
restart service after):!
!
dfs.block.local-path-access.user=hdfs!
dfs.client.read.shortcircuit=true!
dfs.client.read.shortcircuit.skip.checksum=false!
Short Circuit reads allow the mappers
to bypass the overhead of opening a
port to the datanode if the data is
local. The permissions for the local
block files need to permit hdfs to read
them (should be by default already)
See HDFS-2246 for more details.
48. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Execute your query
Page 48
set hive.mapred.reduce.tasks.speculative.execution=false;!
set io.sort.mb=300;!
set mapreduce.reduce.input.limit=-1;!
!
select productid, ROUND(SUM(purchaseamount),2) as total !
from fact_pos !
where part_dt between ‘201210’ and ‘201212’!
group by productid !
order by total desc !
limit 100;!
!
…!
OK!
20535!3026.87!
39079!2959.69!
28970!2869.87!
45594!2821.15!
…!
15649!2242.05!
47704!2241.22!
8140 !2238.61!
Time taken: 40.087 seconds, Fetched: 100 row(s)!
In the case above, we have a simple query executed to test out our table. We have some
example parameters set before our query. The good news is that most of the parameters
regarding join and engine optimizations are already set for you in Hive 0.11 (HDP). The
io.sort.mb is presented as an example of one of the tunable parameters you may want to
change for this particular SQL (note this value assumes 2-3GB JVMs for mappers). We are
also partition pruning for the holiday shopping season, Oct to Dec.
49. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Check Execution Path in Ambari
Page 49
You can check the execution path in Ambari’s Job viewer. This gives a high level overview of
the stages and particular number of map and reduce tasks. With Tez, it also shows task
number and execution order. The counts here are small as this is a sample from a single-node
HDP Sandbox. For more detailed analysis, you will need to read the query plan via “explain”.
50. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Learn To Read The Query Plan
• “explain extended” in front of your query.
• Sections:
– Abstract syntax tree – you can usually ignore this.
– Stage dependencies – dependencies and # of stages.
– Stage plans – important info on how Hive is running the job.
Page 50
51. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
A sample query plan.
Page 51
4 Stages
Stage Details
52. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Use Case: Star Schema Join
Page 52
request
Fact Table
client
dimension
path
dimension
#
Most
popular
URLs
in
US.
select
path,
country,
count(*)
as
cnt
from
request
join
client
on
request.clientsk
=
client.id
join
path
on
request.pathsk
=
path.id
where
client.country
=
'us'
group
by
path,
country
order
by
cnt
desc
limit
5;
Ideal =
Single scan of fact table.
53. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• Generates 5 stages.
Page 53
STAGE
DEPENDENCIES:
Stage-‐1
is
a
root
stage
Stage-‐2
depends
on
stages:
Stage-‐1
Stage-‐3
depends
on
stages:
Stage-‐2
Stage-‐4
depends
on
stages:
Stage-‐3
Stage-‐0
is
a
root
stage
54. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• 4 Map Reduces tells you something is not right.
Page 54
Stage:
Stage-‐1
Map
Reduce
Stage:
Stage-‐2
Map
Reduce
Stage:
Stage-‐3
Map
Reduce
Stage:
Stage-‐4
Map
Reduce
55. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=true;
• Only 4 stages this time.
Page 55
STAGE
DEPENDENCIES:
Stage-‐9
is
a
root
stage
Stage-‐8
depends
on
stages:
Stage-‐9
Stage-‐4
depends
on
stages:
Stage-‐8
Stage-‐0
is
a
root
stage
56. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• Only 2 Map Reduces
Page 56
Stage:
Stage-‐8
Map
Reduce
Stage:
Stage-‐4
Map
Reduce
57. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• Stage-9 is Map Reduce Local Work. What is that?
• Hive is loading the dimension tables (client and path)
directly. Client is filtered by country.
• Remaining map/reduce are for join and order by.
Page 57
client
TableScan
alias:
client
Filter
Operator
predicate:
expr:
(country
=
'us')
...
path
TableScan
alias:
path
...
58. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Fast Query Checklist
Page 58
Par11oned
data
along
natural
query
boundaries
(e.g.
date).
Minimized
data
shuffle
by
co-‐loca1ng
the
most
commonly
joined
data.
Took
advantage
of
skews
for
high-‐frequency
values.
Enabled
short-‐circuit
read.
Used
ORCFile.
Sorted
columns
to
facilitate
row
skipping
for
common
targeted
queries.
Verified
query
plan
to
ensure
single
scan
through
largest
table.
Checked
the
query
plan
to
ensure
par11on
pruning
is
happening.
Used
at
least
one
ON
clause
in
every
JOIN.
59. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
For Even More Hive Performance
Page 59
Increased
replica1on
factor
for
frequently
accessed
data
and
dimensions.
Tuned
io.sort.mb
to
avoid
spilling.
Tuned
mapred.max.split.size,
mapred.min.split.size
to
ensure
1
mapper
wave.
Tuned
mapred.reduce.tasks
to
an
appropriate
value
based
on
map
output.
Checked
jobtracker
to
ensure
“row
container”
spilling
does
not
occur.
Gave
extra
memory
for
mapjoins
like
broadcast
joins.
Disabled
orc.compress
(file
size
will
increase)
and
tuned
orc.row.index.stride.
Ensured
the
job
ran
in
a
single
wave
of
mappers.
60. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 60
61. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Loading Data in Hive
• Sqoop
– Data transfer from external RDBMS to Hive.
– Sqoop can load data directly to/from HCatalog.
• Hive LOAD
– Load files from HDFS or local filesystem.
– Format must agree with table format.
• Insert from query
– CREATE TABLE AS SELECT or INSERT INTO.
• WebHDFS + WebHCat
– Load data via REST APIs.
Page 61
62. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Optimized Sqoop Connectors
• Current:
– Teradata
– Oracle
– Netezza
– SQL Server
– MySQL
– Postgres
• Future:
– Vertica
63. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
High Performance Teradata Sqoop Connector
Page 63
• High performance Sqoop driver.
• Fully parallel data load using
Enhanced FastLoad.
• Multiple readers and writers for
efficient, wire-speed transfer.
• Copy data between Teradata and
HDP Hive, HBase or HDFS.
Enhanced FastLoad developed in partnership with Teradata.
Free to download & use with Hortonworks Data Platform.
Multiple active data channels for fully
parallel data movement (N to M)
64. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ACID Properties
• Data loaded into Hive partition- or table-at-a-time.
– No INSERT or UPDATE statement. No transactions.
• Atomicity:
– Partition loads are atomic through directory renames in HDFS.
• Consistency:
– Ensured by HDFS. All nodes see the same partitions at all times.
– Immutable data = no update or delete consistency issues.
• Isolation:
– Read committed with an exception for partition deletes.
– Partitions can be deleted during queries. New partitions will not be
seen by jobs started before the partition add.
• Durability:
– Data is durable in HDFS before partition exposed to Hive.
Page 64
65. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Handling Semi-Structured Data
• Hive supports arrays, maps, structs and unions.
• SerDes map JSON, XML and other formats natively
into Hive.
Page 65
66. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 66
67. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Authorization
• Hive provides Users, Groups, Roles and Privileges
• Granular permissions on tables, DDL and DML
operations.
• Not designed for high security:
– On non-kerberized cluster, up to the client to supply their user
name.
– Suitable for preventing accidental data loss.
Page 67
68. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HiveServer2
• HiveServer2 is a gateway / JDBC / ODBC endpoint
Hive clients can talk to.
• Supports secure and non-secure clusters.
• DoAs support allows Hive query to run as the
requester.
• (Coming Soon) LDAP authentication.
Page 68
69. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 69
70. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Roadmap to 100x Faster
Page 70
Hadoop
Hive
ORCFile
Column
Store
High
Compression
Predicate
/
Filter
Pushdowns
Buffer
Caching
Cache
accessed
data
Op1mized
for
vector
engine
Tez
Express
data
processing
tasks
more
simply
Eliminate
disk
writes
Tez
Service
Pre-‐warmed
Containers
Low-‐latency
dispatch
Vector
Query
Engine
Op1mized
for
modern
processor
architectures
Base
Op6miza6ons
Generate
simplified
DAGs
Join
Improvements
Query
Planner
Intelligent
Cost-‐Based
Op1mizer
Phase 1
Phase 2
Phase 3
71. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phase 1 Improvements
Path to Making Hive 100x Faster
Page 71
72. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Join Optimizations
• Performance Improvements in Hive 0.11:
• New Join Types added or improved in Hive 0.11:
– In-memory Hash Join: Fast for fact-to-dimension joins.
– Sort-Merge-Bucket Join: Scalable for large-table to large-table
joins.
• More Efficient Query Plan Generation
– Joins done in-memory when possible, saving map-reduce steps.
– Combine map/reduce jobs when GROUP BY and ORDER BY use
the same key.
• More Than 30x Performance Improvement for Star
Schema Join
Page 72
73. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Star Schema Join Improvements in 0.11
Page 73
74. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive 0.11 Star Schema Join Performance
Page 74
In-Memory Hash Join
75. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive 0.11 SMB Join Improvements
Page 75
Sort-Merge-Bucket Join
76. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile – Efficient Columnar Layout
Large block size
well suited for
HDFS.
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.
77. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile Provides High Compression
78. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phase 2 & 3 Improvements
Path to Making Hive 100x Faster
Page 78
79. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Vectorization
• Designed for Modern Processor Architectures
– Make the most use of L1 and L2 cache.
– Avoid branching whenever possible.
• How It Works
– Process records in batches of 1,000 to maximize cache use.
– Generate code on-the-fly to minimize branching.
• What It Gives
– 30x+ improvement in number of rows processed per second.
Page 79
80. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Runtime Optimizations
• Optimized Query Planner
– Automatically determine optimal execution parameters.
• Buffering
– Cache hot data in memory.
Page 80
81. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hadoop 2.0 - YARN
• Re-architected Hadoop framework
• Focus on scale and innovation
– Support 10,000+ computer clusters
– Extensible to encourage innovation
• Next generation execution
– Improves MapReduce performance
• Supports new frameworks beyond
MapReduce
– Low latency, streaming, etc
– Do more with a single Hadoop cluster HDFS
MapReduce
Redundant, Reliable Storage
YARN:
Cluster
Resource
Management
Tez
Graph
Processing
Other
• Separation of Resource Management from MapReduce
82. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
– A new incubation project in the ASF
• Who else is involved?
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
84. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez - Core Idea
Page 84
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
Processor
Input
Output
85. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez: Building blocks for scalable data processing
Page 85
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Map
Processor
HDFS
Input
Sorted
Output
Reduce
Processor
Shuffle
Input
HDFS
Output
Reduce
Processor
Shuffle
Input
Sorted
Output
86. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive – MR Hive – Tez
Hive/MR versus Hive/Tez
Page 86
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
87. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez Service
• Map/Reduce Query Startup Is Expensive
• Solution
– Tez Service
– Hot containers ready for immediate use
– Removes task and job launch overhead (~5s – 30s)
– Hive
– Submits query plan directly to Tez Service
– Native Hadoop service, not ad-hoc
Page 87
88. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 88
89. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive: The De-Facto SQL Interface for Hadoop
Page 89
90. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Connectivity
• JDBC – Included with Hive.
• ODBC – Free driver available at hortonworks.com.
• WebHCat – Run jobs using a simple REST interface.
• BI Ecosystem – Most popular BI tools support Hive.
Page 90
91. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Learn More with Hortonworks HDP Sandbox
Page 91
Built-in tutorials show how to connect
to Hive with ODBC and Excel.
Learn to connect to Hadoop with
popular BI tools.