Apache Hive is an open source data warehousing framework built on Hadoop. It allows users to query large datasets using SQL and handles parallelization behind the scenes. Hive supports various file formats like ORC, Parquet, and Avro. It uses a directed acyclic graph (DAG) execution engine like Tez or Spark to improve performance over traditional MapReduce. The metastore stores metadata about databases, tables, and partitions to allow data discovery and abstraction. Hive's cost-based optimizer and in-memory query processing features like LLAP improve performance for interactive queries on large datasets.
Drill Down the most underestimate Oracle Feature - Database Resource ManagerLuis Marques
Being a crucial feature on managing database load and with real world practice showing that Database
Resource Manager (DBRM) is not often used, this talk want to change this and demystify this feature by
explaining how it works in detail on different scenarios, the CPU math behind it, how to measure it in
real-time using Python and SQL and exploring more complex features to understand its behaviour.
Special attention will be made to understand its internals whenever possible.
Proof of Concept with Real Application Testing 12cLuis Marques
Evaluate how certain real world database workload behaves on different I/O subsystem, processors and
architecture or the coexistence with other databases is the goal of a Proof of Concept. The need of testing
real production workloads to eliminate uncertainty with help of techniques like Workload Folding, Time
Shifting and Schema Remapping, this talk will produce evidence that exploring Real Application Testing
features in 12c leverage what can be accomplished by a Proof of Concept.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Drill Down the most underestimate Oracle Feature - Database Resource ManagerLuis Marques
Being a crucial feature on managing database load and with real world practice showing that Database
Resource Manager (DBRM) is not often used, this talk want to change this and demystify this feature by
explaining how it works in detail on different scenarios, the CPU math behind it, how to measure it in
real-time using Python and SQL and exploring more complex features to understand its behaviour.
Special attention will be made to understand its internals whenever possible.
Proof of Concept with Real Application Testing 12cLuis Marques
Evaluate how certain real world database workload behaves on different I/O subsystem, processors and
architecture or the coexistence with other databases is the goal of a Proof of Concept. The need of testing
real production workloads to eliminate uncertainty with help of techniques like Workload Folding, Time
Shifting and Schema Remapping, this talk will produce evidence that exploring Real Application Testing
features in 12c leverage what can be accomplished by a Proof of Concept.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Introduction to hadoop high availability Omid Vahdaty
Understand how to create a highly available Hadoop cluster.
Active/passive. with manual failover. links to help you get started, knowing to focus on. common mistakes etc.
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Introduction to hadoop high availability Omid Vahdaty
Understand how to create a highly available Hadoop cluster.
Active/passive. with manual failover. links to help you get started, knowing to focus on. common mistakes etc.
Administering a Hadoop cluster isn't easy. Many Hadoop clusters suffer from Linux configuration problems that can negatively impact performance. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn't as expected. Learn how to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
This is the introductory presentation on HBase given by Hayden Marchant in the monthly Amobee Tech Talk.
In this session, we'll learn about HBase, a NoSQL database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns.
HBase is an open-source, non-relational distributed column-oriented database, is linearly scalable, and is designed to run on commodity hardware. HBase clusters can be in the hundreds and thousands of nodes, serving extraordinary amounts of information. Tight integration with Hadoop gives way to allows powerful analytical processing on data residing in HBase.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
4. About me
Oracle ACE
Data and Linux geek
Long time opensource
supporter
works for @redgluept as
Data Architect
@drune
5. Big Data Thinking Strategy
●Think small
●Think big
●Don’t think at all (hype is here)
6. What is Apache Hive?
●open source, TB/PB scale date warehousing
framework based on Hadoop
●The first and more complete SQL-on-”Hadoop”
●SQL:2003 and SQL:2011 compatible
●Data store on several formats
●Several execution engines available
●Interactive Query support (In-memory cache)
7. Apache Hive - Before you ask
●Datawarehouse/OLAP activities (data mining, data
exploration, batch processing, ETL, etc) - “The
heavy lifting of data”
●Low cost scaling, built as extensibility in mind
●Use large datasets (gigabytes/terabytes) scale
●Don’t use Hive for any OLTP activities
●ACID exists, not recommended yet
8. The reason behind Hive
I had written, as part of working with the Feed team - what became - a rather complicated MR
job to rank friends by mutual friends.
In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate
map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how
hard it was to write an optimal MR job (particularly on large data sets).
Assembling data into complex data structures was also painful.
I really wanted to see these types of operators exposed in a high level declarative form so
that the average user would never have to go through this. Fortunately - our team had
Oracle veterans well versed in the art of SQL.
Joydeep Sen Sarma (Facebook)
9. The reason behind Hive
Instead of complex MR jobs
You have declarative language...
10. Apache Hive versions & branches
master branch-1
Version 2.x
New code
New features
Version 1.x
Stable
Backwards
compatibility
Critical
bugs
Hadoop 1.x and 2.x
supported
Hadoop 2.x
supported
stable features
11. Data Model (data units & types)
●Supports primitive column types (integers,
numbers, strings, date time and booleans)
●Supports complex types: Structs, Maps and
Arrays
●Concept of databases, tables, partitions and
buckets
●SerDe: serialize and deserialized API is used to
move data in and out of tables
12. Data Model (partitions & bucketing)
● Partitioning: used for distributing load horizontally,performance benefit,
organization data
PARTITIONED BY (flightName STRING, AircraftName STRING)
/employees/flightName=ABC/AircraftName=XYZ
● Buckets (clusters): decomposing data sets into more manageable parts, help
on map-side joins, and correct sampling on the same bucket
“Records with the same flightID will always be stored in the same bucket.
Assuming the number of flightID is much greater than the number of buckets, each
bucket will have many flightIDs”
CLUSTERED BY (flightID) INTO XX BUCKETS;
13. Data Model (complex data types)
Array Ordered collection of
fields. Fields of the
same type
array(1,2)
Map Unordered key value
pairs. Keys are
primitives, values are
any type
Map (‘a’, 1, ‘b’, 2)
Struct A collection of named
fields
Struct(‘a’,10, 2.5)
15. HiveQL
●HiveQL is an SQL-like query language for Hive
●Supports DDL and DML
●Supports multi-table inserts
●Possible to write custom map-reduce scripts
●Supports UDF, UDAF UDTF
16. DDL (some examples)
HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> TRUNCATE TABLE
HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW
HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,
PARTITIONS, FUNCTIONS
HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name
17. File formats
● Parquet: compressed, efficient columnar data
representation available to any project in the Hadoop
● ORC: made for Hive, support Hive type model,columnar
storage, block compression, predicate pushdown, ACID*,
etc
● Avro: JSON for defining data types and protocols, and
serializes data in a compact binary format
● Compressed file formats (LZO, *.GZIP)
● Plain Text Files
● Any other type to data subject to a format is possible to be
read (csv, json, xml, etc)
18. ORC
●Stored as columns and compressed = smaller disk
reads
●ORC has a built-in index, min/max values, and
other aggregates (eg: sum,max) = skip entire
blocks to speed up reads
●ORC implements predicate pushdown and bloom
filters
●ORC scale
●You should use it :-)
19. Indexing
● Not recommended because of ORC;
● ORC has build in Indexes which allow the format to skip
blocks of data during read
● Hive indexes are implemented as tables
● Compact indexes and bitmap indexes supported
● Tables that provide information about which data is in
which blocks and are used to skip data (like ORC already
does)
● Not supported on Tez engine - ignored
● Indexes in Hive are not like indexes in other databases.
21. Hive Architecture
Hive Web
Interface
Hive CLI (beeline, hive)
Hive JDBC/ODBC
Driver
Compiler (Parser, Semantic Analyser,
Logical Plan Generator, Query plan
Generator)
Executor
Optimizer
Metastore
client
Trift Server (HiveServer2)
Metastore RDBMS
Execution
Engines
Map Reduce Tez Spark
Resource Management YARN
Storage HDFS HBase
Azure Storage
Amazon S3
22. Metastore
● Typically stored in a RDBMS (MySQL; SQLServer;
PostgreSQL, Derby*) - ACID and concurrency on metadata
querys
● Contains: metadata for databases, tables or partitions
● Provides two features: data discovery and data abstraction
● Data abstraction: provide information about data formats,
extractors and loaders in table creation and reused, (ex:
dictionary tables - Oracle)
● Data discovery: discover relevant and specific data, allow
other tools to use metadata to explore data (Ex: SparkSQL)
24. Execution engines
● 3 execution engines are available:
○ MapReduce (mr)
○ Tez
○ Spark
MR: The original, most stable and more reliable, batch oriented, disk-
based parallel (like traditional Hadoop MR jobs).
Tez: High performance batch and interactive data processing. Stable in
99% of the time. The one that you should use. Default on HDP.
Spark: Uses Apache Spark (in-memory computing platform), High-
performance (like Tez), not used in production (yet), good progress
25. MapReduce vs Tez/Spark
MapReduce:
● One pair of map and reduce does one level of aggregation over the
data. Complex computations typically require multiple such steps.
Tez/Spark:
● DAG (Directed Acyclic Graph)
● The graph does not have cycles because the fault tolerance
mechanism used by Tez is re-execution of failed tasks
● The limitations of MapReduce in Hadoop became a key point to
introduce DAG
● Pipelining consecutive map steps into one
● Enforce concurrency and serialization between MapReduce jobs
26. Tez & DAGs
DAG Definition:
● Data processing is expressed in the form of a directed acyclic graph
(DAG)
Two main components:
● vertices - in the graph representing processing of data
○ user logic, that analyses and modifies the data, sits in the vertices
● edges - representing movement of data between the processing
○ Defines routing of data between tasks (One-To-One, Broadcast
Scatter-Gather)
○ Defines when a consumer task is scheduled (Sequential,
Concurrent)
○ Defines the lifetime/reliability of a task output
27. Hive Cost Based Optimizer - Why
● Distributed SQL query processing in Hadoop differs from conventional
relational query engine when it comes to handling of intermediate
result sets
● Query processing requires sorting and reassembling of intermediate
result set - shuffling
● Most of the existing optimizations in Hive are about minimizing
shuffling cost and logical optimizations like filter push down,
projection pruning and partition pruning
● Join reordering and join algorithm possible with cost based optimizer.
28. Hive CBO - What to get
● Based on a project called Apache Calcite (https://calcite.apache.org/)
● You can get using a Cost Based Optimizer:
○ How to order Join (join reordering)
○ Algorithm to use for a Join
○ Intermediate result be persisted or should it be recomputed on
failure
○ degree of parallelism at any operator (number of mappers and
reducers
○ Semi Join selection
○ (others optimizer tricks like histograms)
30. Hive - The present-future
● Tez and Spark head to head on performance and stability
● LLAP (Long Live and Process) - Hive interactive querys
● ACID
31. Hive next big thing: LLAP
● Sub second querys (Interactive Querys)
● In-memory caching layer with async I/O
● Fast concurrent execution
● Move from disk oriented to memory oriented execution (trend)
● Disks are connect to CPU via network - data locality is not relevant
SQL:2011 - Seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language
2007 - 15TB
2009 - 2PB
SQL:2011 - Seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language
https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-UnderstandingHiveBranches
Release and feature branches not added to slide as we might be complex
Predicate Pushdown: Running operations that filter or cutdown data as close to the beginning of your map reduce pipeline as possible
Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
Show create tables different formats (ORC and PLAINTEXT)
Create an index on a table:
Not supported in TEZ
Set hive.execution.engine=mr
create index idxFlightNum on table flightperfall(flightnum) AS 'COMPACT' WITH DEFERRED REBUILD;
alter index idxFlightNum ON flightperfall rebuild;
show formatted index on flightperfall;
explain select * from flightperfall where flightnum=613 limit 1;
set hive.optimize.index.filter.compact.minsize=10;
explain select * from flightperfall where flightnum=613 limit 1;
Set hive.optimize.index.filter.compact.minsize=5368709120
Execution times;
Show operator tree with index and without index
ORC vs CSV query time:
select * from flightperfall_orc where flightnum=613 limit 1;
Describe - components
HiveCLI - management tools
Ambari - Apache Ambari is a tool for provisioning, managing, and monitoring Apache Hadoop clusters.
HiveServer2 - HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It is based on Apache Thrift RPC. Itis an improved version of HiveServer and supports multi-client concurrency and authentication and better support for open API clients like JDBC and ODBC.
Driver - Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution
Compiler - The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore:
Parser – Transform a query string to a parse tree representation
Semantic Analyser - Transform the parse tree to an internal query representation (column names are verified and expansions like * are performed), Type-checking and any implicit type conversions and partition checking.
Logical Plan Generator - Convert the internal query representation to a logical plan, which consists of a tree of operators. This step also includes the optimizer to transform the plan to improve performance;
Query Plan Generator – Convert the logical plan to a series of map-reduce tasks (or DAGs stages)
Optimizer - As of 2011, it was rule-based and performed the following: column pruning and predicate pushdown. Now it is cost based like RDBMS.
Executor engine (Processing) - The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages
Metastore - The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
https://cwiki.apache.org/confluence/display/Hive/Design
ssh root@127.0.0.1 -p 2222 (sandbox)
Test CLI (beeline and hive cmd)
Beeline: !connect jdbc:hive2://localhost:10000
Show ambari
- Identify metastore hive (mysql database)
- mysql -u root -p ; password: hadoop ; show databases; use hive; select * from DBS; select * from TBLS;
Identify execution engines:
SET hive.execution.engine
Identify CBO active:
set hive.cbo.enable;
set hive.compute.query.using.stats;
set hive.stats.fetch.column.stats;
set hive.stats.fetch.partition.stats;
explain select * from sample_07, sample_08 where sample_07.code = sample_08.code and sample_07.salary > 1000;
Conditions for CBO: example: statistics of table, colums or other (too few joins).
Show a database, a table and a file stored in HDFS
Hdfs
Tez – Hindi for “speed”
Example: jobs A and B are independent of each other, but job C needs the results from A and B to complete, Tez will execute A and B in any order and forward the results to C
One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards
Sequential: Consumer task may be scheduled after a producer task completes.
Concurrent: Consumer task must be co-scheduled with a producer task.
In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging.
In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging.
http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-14-2.pdf
Query:
SELECT year, month, origin, dest, distance FROM flights.flightperfall_orc where flightnum in (select max(flightnum) from flights.flightperfpartorc where year=2008)
MR: (41.1 seconds)
Tez: (3.761 seconds)
Show Tez View (via ambari)
analyze table customer COMPUTE STATISTICS;
analyze table customer COMPUTE STATISTICS for columns;
use foodmart;
explain select * from sales_fact_dec_1998 sf, customer c, product p, store ss
where sf.customer_id = c.customer_id
and p.product_id = sf.product_id
and ss.store_id = sf.store_id
and sf.customer_id > 100
and ss.store_id = 5