SlideShare a Scribd company logo
1 of 26
Hive
What’s new and what’s next?


Gunther Hagleitner
Hortonworks
@yakrobat




                              Page 1
STINGER Initiative




               SPEEED!! POWERRR!! – Jeremy Clarkson

                                              Page 2
ROLLUP, CUBE (Hive 0.10)
select state, year, sum(amt_paid)   select state, year, sum(amt_paid)
from sales                          from sales
group by state, year with rollup    group by state, year with cube

State      Year       Sum           State      Year       Sum
CA         2011       20000         CA         2011       20000
CA         2012       25000         CA         2012       25000
CA         *          45000         CA         *          45000
NY         2012       15000         NY         2012       15000
NY         *          15000         NY         *          15000
*          *          60000         *          *          60000
                                    *          2011       20000
                                    *          2012       40000


                                                                  HIVE-3433
                                                                        Page 3
Support for Analytics
• Simple analytical tasks can turn into unintuitive and inefficient queries

select
 count(*) as rk,
 s2.state as state,
 s2.product as product,
 avg(s2.amt_paid),
 sum(s1.amt_paid)
from
 sales s1
   join sales s2
   on (s1.product = s2.product and s1.state = s2.state)
where s1.year <= s2.year
group by s2.state, s2.product, s2.year
order by state, product, rk;



                                                                          Page 4
Support for Analytics
• Simple numbering + running total


   Number    State      Product      Amount   Total

   1         CA         A            1000     1000

   2         CA         A            500      1500

   3         CA         A            700      2200

   4         CA         A            300      2500

   1         CA         B            500      500

   2         CA         B            500      1000




                                                      Page 5
Support for Analytics
• Faster, but still not very intuitive

select
   state,
   product,
   amt_paid,
   rsum(hash(state, product),amt_paid)
from
(
  select state, product, amt_paid
  from sales distribute by hash(state,product)
  sort by state, product
) t;




                                                 Page 6
Support for Analytics – OVER clause
• Now that’s more like it

select
  rank() over state_and_product,
  state,
  product,
  amt_paid,
  sum(amt_paid) over state_and_product
from sales
window state_and_product
  as (partition by state, product order by year);




                                                    Page 7
Support for Analytics – OVER clause
      partition by          order by                                      rows


         AL                  2012               1000.00
         CA                  2010               2000.00
         CA                  2011               2000.00
         CA                  2012               4000.00
         CA                  2013               1000.00
         NY                  2012               500.00

• OVER clause
   – PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING
   – Works with current aggregate functions
   – New aggregates/window functions
        – RANK, LEAD, ROW_NUMBER, LAG, LEAD, FIRST_VALUE, LAST_VALUE
        – NTILE, DENSE_RANK, CUME_DIST, PERCENT_RANK, PERCENT_CONT,
          PERCENT_DISC
                                                                       HIVE-896
                                                                           Page 8
Support for Analytics Continued

• Sub-queries in WHERE
   – Non-correlated only
   – [NOT] IN supported
   – Plan to optimize to fit in memory as hash table when feasible, join when not


• Standard SQL data types
   –   datetime
   –   char() and varchar()
   –   add precision and scale to decimal and float
   –   aliases for standard SQL types (BLOB = binary, CLOB = string, integer = int,
       real/number = decimal)




                                                                                      Page 9
Automatic join conversion



          Sorted?
          Sorted?



                    Sort Merge
                     Sort Merge
                    Bucket Join
                    Bucket Join


                      • When enabled hive will automatically
                        pick join implementation
                      • Query hints no longer needed
                      • Can be configured to run without
                        conditional tasks

                                                     HIVE-3784
                                                          Page 10
Merging join tasks
select                                   Task         Task          Task
   …                                      Mapjoin      Mapjoin      Mapjoin
                                          Mapjoin      Mapjoin      Mapjoin
from sales
   join date_dim on (…)
   join time_dim on (…)
   join state on (…)                     Task
   …                                        Mapjoin -> Mapjoin ->Mapjoin
                                            Mapjoin -> Mapjoin ->Mapjoin



•Used to generate sequence of map-only jobs
•Hive will now do as many map-joins as fit in memory in single map-only job
•Memory limit is configurable
•Memory size is estimated from file size



                                                                      HIVE-3784
                                                                              Page 11
M-MR to MR
select                                  Map Task     Map Task      Reduce Task
   sum(…)                               Mapjoin      Mapside          Group
                                        Mapjoin      Mapside           Group
   …                                                  Aggr
                                                       Aggr          by/Aggr
                                                                      by/Aggr
from sales
   join date_dim        on (…)
group by …                            Map Task                  Reduce Task
   …                                  Mapjoin -> Mapside
                                      Mapjoin -> Mapside          Group by/Aggr
                                                                  Group by/Aggr
                                             Aggr
                                             Aggr



•Used to run as map-only job followed by a map-reduce job
•Hive will now merge the two map tasks




                                                                       HIVE-3952
                                                                              Page 12
Group by/Order by (ReduceSinkDeDup)
select                         Map Task     Reduce Task    Map Task      Reduce Task
   …                              Map-         Group         Noop            Noop
                                  Map-          Group        Noop            Noop
from sales                     side/Aggr
                                side/Aggr     by/Aggr
                                               by/Aggr
group by store, item
order by store
                                            Map Task      Reduce Task

                                                Map-
                                                Map-          Group
                                                               Group
                                             side/Aggr
                                              side/Aggr      by/Aggr
                                                              by/Aggr

•Used to generate map-reduce job for group by followed by map-reduce job for
order by
•Hive will now do both in same job
•More general: Will search for reduce sinks on same keys and combine
•Caution: Might degrade performance if difference in num reducers is big


                                                                        HIVE-2340
                                                                            Page 13
Upcoming: Limit pushdown
select                                         Map Task      Reduce Task
   …                                               Map-              Group
                                                   Map-               Group
from sales                                      side/Aggr
                                                 side/Aggr       by/Aggr+Limit
                                                                  by/Aggr+Limit
   group by store, item
   order by store
limit 20                                      Map Task       Reduce Task

                                          Map-side/Aggr
                                          Map-side/Aggr             Group
                                                                     Group
                                             +Top-k
                                              +Top-k            by/Aggr+Limit
                                                                 by/Aggr+Limit

•Used to output all pre-aggregated data from map task and limit the output in
the reducer
•Hive will keep a top-k list of elements in each map task, reducing the amount
of data to be shuffled



                                                                           HIVE-3562
                                                                              Page 14
Upcoming: Total order sort




• Order by queries no longer result in single reducer
• Makes it easier to apply optimizations such as group by/order by
• Requires knowledge of the key distribution (sampling)




                                                                     HIVE-1402
                                                                        Page 15
ORC – Optimized RCFile




                             HIVE-3874
                                Page 16
   © Hortonworks Inc. 2012
File Layout




                              Page 17
    © Hortonworks Inc. 2012
ORC-enabled improvements




                             Page 18
   © Hortonworks Inc. 2012
Beyond Batch with YARN & Tez




 Tez Generalizes Map-Reduce           Always-On Tez Service
Simplified execution plans process   Low latency processing for
        data more efficiently        all Hadoop data processing




                                                                  Page 19
Tez Service
• Hive Query Startup Expensive
  – Job launch & task-launch latencies are fatal for short queries (in
    order of 5s to 30s)
• Solution
  – Tez Service
      – Removes task-launch overhead
      – Removes job-launch overhead
  – Hive
      – Submit query-plan to Tez Service
  – Native Hadoop service, not ad-hoc




                                                                     Page 20
Tez- Core Idea
Task with pluggable Input, Processor & Output




      Input
        Input   Processor
                 Processor   Output
                              Output




                 Tez Task




           YARN ApplicationMaster to run DAG of Tez Tasks

                                                     Page 21
Hive/MR versus Hive/Tez

                                   select a.state, count(*)
                                   from a join b on (a.id = b.id)
                                   group by a.state




 I/O Synchronization                                                             I/O Pipelining
       Barrier




                       Hive - MR                                    Hive - Tez


                                                                                                  Page 22
Hive/MR versus Hive/Tez
          select store, state, total from
            (select storeid, sum(sales_price) total
             from sales s join date_dim d on (s.dateid = d.dateid)
             where d.year = 2012
             group by storeid ) ss
          join
            (select storeid, store, state from store
             join state on (store.stateid = state.stateid) ) sd
          on (sd.storeid = ss.storeid)




    Hive - MR                                                        Hive - Tez


                                                                                  Page 23
Hive Performance Longer Term - Caching
•   Need to be able to keep hot data sets in memory
•   Could be done via pinning files in OS buffer cache
•   Could be done with separate process running its own buffer cache
•   Need to evaluate best plan
•   Would like to pin dimension tables in memory
•   Latest partitions of large tables also a candidate
•   Ideally will include changes to the scheduler to understand which nodes have
    which partitions/tables cached




                                                                             Page 24
Hive Performance Longer Term -
Vectorization
• Rewrite operators to work on arrays of Java scalars
• MonetDB paper
• Operates on blocks of 1K or more records
• Each block contains an array of Java scalars, one for each column
• Avoids many function calls
• Size to fit in L1 cache, avoid cache misses
• Generate code for operators on the fly to avoid branches in code, maximize
  deep pipelines of modern processers
• Requires conversion of all column values to Java scalars – no objects allowed
    – Integrates nicely with ORC work, other input types will need conversion on reading
• Want to write this in a way it can be shared by Pig, Cascading, MR
  programmers




                                                                                    Page 25
Page 26

More Related Content

Viewers also liked

Sub-queries,Groupby and having in SQL
Sub-queries,Groupby and having in SQLSub-queries,Groupby and having in SQL
Sub-queries,Groupby and having in SQLVikash Sharma
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Procedures and triggers in SQL
Procedures and triggers in SQLProcedures and triggers in SQL
Procedures and triggers in SQLVikash Sharma
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commandsbispsolutions
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 

Viewers also liked (10)

Sub-queries,Groupby and having in SQL
Sub-queries,Groupby and having in SQLSub-queries,Groupby and having in SQL
Sub-queries,Groupby and having in SQL
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Procedures and triggers in SQL
Procedures and triggers in SQLProcedures and triggers in SQL
Procedures and triggers in SQL
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similar to What's new in Apache Hive

Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Risk managementusinghadoop
Risk managementusinghadoopRisk managementusinghadoop
Risk managementusinghadoopsapientindia
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practicesItay Braun
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)hava101
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesYang Li
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezJ On The Beach
 
Z Garbage Collector
Z Garbage CollectorZ Garbage Collector
Z Garbage CollectorDavid Buck
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api TrainingSpark Summit
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 

Similar to What's new in Apache Hive (20)

Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Risk managementusinghadoop
Risk managementusinghadoopRisk managementusinghadoop
Risk managementusinghadoop
 
Biug 20112026 dimensional modeling and mdx best practices
Biug 20112026   dimensional modeling and mdx best practicesBiug 20112026   dimensional modeling and mdx best practices
Biug 20112026 dimensional modeling and mdx best practices
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
Yoda fifth elephant
Yoda fifth elephantYoda fifth elephant
Yoda fifth elephant
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and OpsSpark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
 
Z Garbage Collector
Z Garbage CollectorZ Garbage Collector
Z Garbage Collector
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

What's new in Apache Hive

  • 1. Hive What’s new and what’s next? Gunther Hagleitner Hortonworks @yakrobat Page 1
  • 2. STINGER Initiative SPEEED!! POWERRR!! – Jeremy Clarkson Page 2
  • 3. ROLLUP, CUBE (Hive 0.10) select state, year, sum(amt_paid) select state, year, sum(amt_paid) from sales from sales group by state, year with rollup group by state, year with cube State Year Sum State Year Sum CA 2011 20000 CA 2011 20000 CA 2012 25000 CA 2012 25000 CA * 45000 CA * 45000 NY 2012 15000 NY 2012 15000 NY * 15000 NY * 15000 * * 60000 * * 60000 * 2011 20000 * 2012 40000 HIVE-3433 Page 3
  • 4. Support for Analytics • Simple analytical tasks can turn into unintuitive and inefficient queries select count(*) as rk, s2.state as state, s2.product as product, avg(s2.amt_paid), sum(s1.amt_paid) from sales s1 join sales s2 on (s1.product = s2.product and s1.state = s2.state) where s1.year <= s2.year group by s2.state, s2.product, s2.year order by state, product, rk; Page 4
  • 5. Support for Analytics • Simple numbering + running total Number State Product Amount Total 1 CA A 1000 1000 2 CA A 500 1500 3 CA A 700 2200 4 CA A 300 2500 1 CA B 500 500 2 CA B 500 1000 Page 5
  • 6. Support for Analytics • Faster, but still not very intuitive select state, product, amt_paid, rsum(hash(state, product),amt_paid) from ( select state, product, amt_paid from sales distribute by hash(state,product) sort by state, product ) t; Page 6
  • 7. Support for Analytics – OVER clause • Now that’s more like it select rank() over state_and_product, state, product, amt_paid, sum(amt_paid) over state_and_product from sales window state_and_product as (partition by state, product order by year); Page 7
  • 8. Support for Analytics – OVER clause partition by order by rows AL 2012 1000.00 CA 2010 2000.00 CA 2011 2000.00 CA 2012 4000.00 CA 2013 1000.00 NY 2012 500.00 • OVER clause – PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING – Works with current aggregate functions – New aggregates/window functions – RANK, LEAD, ROW_NUMBER, LAG, LEAD, FIRST_VALUE, LAST_VALUE – NTILE, DENSE_RANK, CUME_DIST, PERCENT_RANK, PERCENT_CONT, PERCENT_DISC HIVE-896 Page 8
  • 9. Support for Analytics Continued • Sub-queries in WHERE – Non-correlated only – [NOT] IN supported – Plan to optimize to fit in memory as hash table when feasible, join when not • Standard SQL data types – datetime – char() and varchar() – add precision and scale to decimal and float – aliases for standard SQL types (BLOB = binary, CLOB = string, integer = int, real/number = decimal) Page 9
  • 10. Automatic join conversion Sorted? Sorted? Sort Merge Sort Merge Bucket Join Bucket Join • When enabled hive will automatically pick join implementation • Query hints no longer needed • Can be configured to run without conditional tasks HIVE-3784 Page 10
  • 11. Merging join tasks select Task Task Task … Mapjoin Mapjoin Mapjoin Mapjoin Mapjoin Mapjoin from sales join date_dim on (…) join time_dim on (…) join state on (…) Task … Mapjoin -> Mapjoin ->Mapjoin Mapjoin -> Mapjoin ->Mapjoin •Used to generate sequence of map-only jobs •Hive will now do as many map-joins as fit in memory in single map-only job •Memory limit is configurable •Memory size is estimated from file size HIVE-3784 Page 11
  • 12. M-MR to MR select Map Task Map Task Reduce Task sum(…) Mapjoin Mapside Group Mapjoin Mapside Group … Aggr Aggr by/Aggr by/Aggr from sales join date_dim on (…) group by … Map Task Reduce Task … Mapjoin -> Mapside Mapjoin -> Mapside Group by/Aggr Group by/Aggr Aggr Aggr •Used to run as map-only job followed by a map-reduce job •Hive will now merge the two map tasks HIVE-3952 Page 12
  • 13. Group by/Order by (ReduceSinkDeDup) select Map Task Reduce Task Map Task Reduce Task … Map- Group Noop Noop Map- Group Noop Noop from sales side/Aggr side/Aggr by/Aggr by/Aggr group by store, item order by store Map Task Reduce Task Map- Map- Group Group side/Aggr side/Aggr by/Aggr by/Aggr •Used to generate map-reduce job for group by followed by map-reduce job for order by •Hive will now do both in same job •More general: Will search for reduce sinks on same keys and combine •Caution: Might degrade performance if difference in num reducers is big HIVE-2340 Page 13
  • 14. Upcoming: Limit pushdown select Map Task Reduce Task … Map- Group Map- Group from sales side/Aggr side/Aggr by/Aggr+Limit by/Aggr+Limit group by store, item order by store limit 20 Map Task Reduce Task Map-side/Aggr Map-side/Aggr Group Group +Top-k +Top-k by/Aggr+Limit by/Aggr+Limit •Used to output all pre-aggregated data from map task and limit the output in the reducer •Hive will keep a top-k list of elements in each map task, reducing the amount of data to be shuffled HIVE-3562 Page 14
  • 15. Upcoming: Total order sort • Order by queries no longer result in single reducer • Makes it easier to apply optimizations such as group by/order by • Requires knowledge of the key distribution (sampling) HIVE-1402 Page 15
  • 16. ORC – Optimized RCFile HIVE-3874 Page 16 © Hortonworks Inc. 2012
  • 17. File Layout Page 17 © Hortonworks Inc. 2012
  • 18. ORC-enabled improvements Page 18 © Hortonworks Inc. 2012
  • 19. Beyond Batch with YARN & Tez Tez Generalizes Map-Reduce Always-On Tez Service Simplified execution plans process Low latency processing for data more efficiently all Hadoop data processing Page 19
  • 20. Tez Service • Hive Query Startup Expensive – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution – Tez Service – Removes task-launch overhead – Removes job-launch overhead – Hive – Submit query-plan to Tez Service – Native Hadoop service, not ad-hoc Page 20
  • 21. Tez- Core Idea Task with pluggable Input, Processor & Output Input Input Processor Processor Output Output Tez Task YARN ApplicationMaster to run DAG of Tez Tasks Page 21
  • 22. Hive/MR versus Hive/Tez select a.state, count(*) from a join b on (a.id = b.id) group by a.state I/O Synchronization I/O Pipelining Barrier Hive - MR Hive - Tez Page 22
  • 23. Hive/MR versus Hive/Tez select store, state, total from (select storeid, sum(sales_price) total from sales s join date_dim d on (s.dateid = d.dateid) where d.year = 2012 group by storeid ) ss join (select storeid, store, state from store join state on (store.stateid = state.stateid) ) sd on (sd.storeid = ss.storeid) Hive - MR Hive - Tez Page 23
  • 24. Hive Performance Longer Term - Caching • Need to be able to keep hot data sets in memory • Could be done via pinning files in OS buffer cache • Could be done with separate process running its own buffer cache • Need to evaluate best plan • Would like to pin dimension tables in memory • Latest partitions of large tables also a candidate • Ideally will include changes to the scheduler to understand which nodes have which partitions/tables cached Page 24
  • 25. Hive Performance Longer Term - Vectorization • Rewrite operators to work on arrays of Java scalars • MonetDB paper • Operates on blocks of 1K or more records • Each block contains an array of Java scalars, one for each column • Avoids many function calls • Size to fit in L1 cache, avoid cache misses • Generate code for operators on the fly to avoid branches in code, maximize deep pipelines of modern processers • Requires conversion of all column values to Java scalars – no objects allowed – Integrates nicely with ORC work, other input types will need conversion on reading • Want to write this in a way it can be shared by Pig, Cascading, MR programmers Page 25